The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.
Bellow, you will find my analysis for the Titanic challenge created by the Kaggle team. The Competiotion started on September 28, 2012 and will be finished on December 31, 2016.
To create my own solution for this challenge, I studied about the Titanic and its survivors.
Some information can be found in websites as https://www.encyclopedia-titanica.org/ or http://www.titanicfacts.net/titanic-passengers.html.
After spending some time understanding the situation from where the data were obtained, I started digging the available datasets.
First I installed all the R packages necessary for the analysis. The Packages can be loaded along the way.
#install.packages("htmlwidgets")
#install_github("easyGgplot2", "kassambara")
#install.packages("devtools")
#library(htmlwidgets)
library('ggplot2')
library('ggthemes')
library('scales')
library('dplyr')
library('mice')
library('randomForest')
library('Hmisc')
library('reshape2')
library('stringr')
library('plyr')
library('gridExtra')
library('devtools')
library('easyGgplot2')
library('vcd')
library('rpart')
library('rattle')
library('rpart.plot')
library('RColorBrewer')
library('caret')
Create functions during the analysis is always important, especially when you have to deal with many repeated actions.
To facilitate the datasets reading, I used a function avaliable on the Internet.
readData <- function(fileName, VariableType, missingNA) {
read.csv2(fileName, sep=",",dec = ".",
colClasses=VariableType,
na.strings=missingNA)
}
train.data <- "train.csv"
test.data <- "test.csv"
missingNA <- c("NA", "")
train.VariableType <- c('integer', # PassengerId
'numeric', # Survived
'factor', # Pclass
'character', # Name
'factor', # Sex
'numeric', # Age
'integer', # SibSp
'integer', # Parch
'character', # Ticket
'numeric', # Fare
'character', # Cabin
'factor' # Embarked
)
test.VariableType <- train.VariableType[-2] ## There is no "Survived" variable in the test file
dt.train <- readData(train.data, train.VariableType, missingNA)
dt.test <- readData(test.data,test.VariableType, missingNA)
The first step to work with Machine Learning is to evaluate the train dataset. For this step I summarizing the it.
summary(dt.train)
## PassengerId Survived Pclass Name Sex
## Min. : 1.0 Min. :0.0000 1:216 Length:891 female:314
## 1st Qu.:223.5 1st Qu.:0.0000 2:184 Class :character male :577
## Median :446.0 Median :0.0000 3:491 Mode :character
## Mean :446.0 Mean :0.3838
## 3rd Qu.:668.5 3rd Qu.:1.0000
## Max. :891.0 Max. :1.0000
##
## Age SibSp Parch Ticket
## Min. : 0.42 Min. :0.000 Min. :0.0000 Length:891
## 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000 Class :character
## Median :28.00 Median :0.000 Median :0.0000 Mode :character
## Mean :29.70 Mean :0.523 Mean :0.3816
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## NA's :177
## Fare Cabin Embarked
## Min. : 0.00 Length:891 C :168
## 1st Qu.: 7.91 Class :character Q : 77
## Median : 14.45 Mode :character S :644
## Mean : 32.20 NA's: 2
## 3rd Qu.: 31.00
## Max. :512.33
##
From the summary table, we can see that the variables Age and Embarked have missing data and we will deal with them later.
Here, we labeled the categorical variables to make them easier to read.
dt.train$Survived <- factor(dt.train$Survived, levels=c(1,0))
levels(dt.train$Survived) <- c("Survived", "Died")
dt.train$Pclass <- as.factor(dt.train$Pclass)
levels(dt.train$Pclass) <- c("1st Class", "2nd Class", "3rd Class")
dt.train$Sex <- factor(dt.train$Sex, levels=c("female", "male"))
levels(dt.train$Sex) <- c("Female", "Male")
mosaicplot(Pclass ~ Sex,
data=dt.train, main="Titanic Training Data Passenger Survival by Class",
color=c("#8dd3c7", "#fb8072"), shade=FALSE, xlab="", ylab="",
off=c(0), cex.axis=1.4)
table(dt.train$Pclass,dt.train$Sex)
##
## Female Male
## 1st Class 94 122
## 2nd Class 76 108
## 3rd Class 144 347
round(prop.table(table(dt.train$Pclass,dt.train$Sex),1),3)
##
## Female Male
## 1st Class 0.435 0.565
## 2nd Class 0.413 0.587
## 3rd Class 0.293 0.707
Analysing the figure and the tables above, it is clear that there were more men than women in the Titanic traning dataset, especially in the third class.
mosaicplot(Sex ~ Survived,
data=dt.train,
color=c("#8dd3c7", "#fb8072"), shade=FALSE, xlab="", ylab="",
off=c(0), cex.axis=1.4,
main="Titanic Training Data\nPassenger Survival by Sex")
table(dt.train$Sex,dt.train$Survived)
##
## Survived Died
## Female 233 81
## Male 109 468
round(prop.table(table(dt.train$Sex,dt.train$Survived),1),3)
##
## Survived Died
## Female 0.742 0.258
## Male 0.189 0.811
mosaicplot(Pclass ~ Survived,
data=dt.train,
color=c("#8dd3c7", "#fb8072"), shade=FALSE, xlab="", ylab="",
off=c(0), cex.axis=1.4,
main="Titanic Training Data\nPassenger Survival by Class")
table(dt.train$Pclass,dt.train$Survived)
##
## Survived Died
## 1st Class 136 80
## 2nd Class 87 97
## 3rd Class 119 372
round(prop.table(table(dt.train$Pclass,dt.train$Survived),1),3)
##
## Survived Died
## 1st Class 0.630 0.370
## 2nd Class 0.473 0.527
## 3rd Class 0.242 0.758
The graphs and tables above show that the proportion of survivors are higher for female (74% vs 19%) and also between the first class passenger (63%), followed by the second (47%) and third class (24%), respectively.
h<-ggplot(dt.train,aes(x = Pclass, fill = Survived,y = (..count..))) +
geom_bar() + labs(y = "Count")+
labs(title="Titanic Training Data: Survived by Class")
h1<-h+scale_fill_manual(values=c("#8dd3c7", "#fb8072"))
h2<-h1+scale_color_manual(values=c("#8dd3c7","#fb8072"))
p<-ggplot(subset(dt.train, dt.train$Sex=="Female"),aes(x = Pclass, fill = Survived,y = (..count..))) +
geom_bar() + labs(y = "Count")+
labs(title="Female by Class")
p1<-p + scale_y_continuous(limits = c(0, 350))
p2<-p1+scale_fill_manual(values=c("#8dd3c7", "#fb8072"))
p3<-p2+scale_color_manual(values=c("#8dd3c7","#fb8072"))
q<-ggplot(subset(dt.train, dt.train$Sex=="Male"),aes(x = Pclass, fill = Survived,y = (..count..))) +
geom_bar() + labs(y = "Count")+
labs(title="Male by Class")
q1<-q + scale_y_continuous(limits = c(0, 350))
q2<-q1+scale_fill_manual(values=c("#8dd3c7", "#fb8072"))
q3<-q2+scale_color_manual(values=c("#8dd3c7","#fb8072"))
grid.arrange(h2, ncol=1, nrow =1)
grid.arrange(p2, q2, ncol=1, nrow =2)
Now, let’s do the analysis of the survivors for each class and gender. For the female, we observe only a few deaths on the first and the second classes, with most of them happening on the third class (almost 50%). For male, we see higher proportion of survivors on the first class, but it does not look like to have a pattern as the worst survivor porportion is on the second class.
mosaicplot(SibSp ~ Survived,
data=dt.train,
color=c("#8dd3c7", "#fb8072"), shade=FALSE, xlab="", ylab="",
off=c(0), cex.axis=1.4,
main="Titanic Training Data\nPassenger Survival by the Number of Siblings/Spouses Aboard")
table(dt.train$SibSp,dt.train$Survived)
##
## Survived Died
## 0 210 398
## 1 112 97
## 2 13 15
## 3 4 12
## 4 3 15
## 5 0 5
## 8 0 7
round(prop.table(table(dt.train$SibSp,dt.train$Survived),1),3)
##
## Survived Died
## 0 0.345 0.655
## 1 0.536 0.464
## 2 0.464 0.536
## 3 0.250 0.750
## 4 0.167 0.833
## 5 0.000 1.000
## 8 0.000 1.000
The following is about the number of siblings and spouses, and parents and children aboard on the Titanic.
We can interpret that a person accompanied by one or two family members seems to had higher chance to survive.
h<-ggplot(dt.train,aes(x=SibSp, fill=Survived, color=Survived)) +
geom_histogram(position="identity", alpha=0.5,bins=10) +
labs(title="Titanic Training Data: \nNumber of Siblings/Spouses Aboard by Variable Survived")
h1<-h+scale_fill_manual(values=c("#8dd3c7", "#fb8072"))
h2<-h1+scale_color_manual(values=c("#8dd3c7","#fb8072"))
q<-ggplot(subset(dt.train, dt.train$Sex=="Female"),aes(x=SibSp, fill=Survived, color=Survived)) +
geom_histogram(position="identity", alpha=0.5,bins=10) +
labs(title="Number of Siblings/Spouses Aboard for Female")
q1<-q+scale_fill_manual(values=c("#8dd3c7", "#fb8072"))
q2<-q1+scale_color_manual(values=c("#8dd3c7","#fb8072"))
p<-ggplot(subset(dt.train, dt.train$Sex=="Male"),aes(x=SibSp, fill=Survived, color=Survived)) +
geom_histogram(position="identity", alpha=0.5,bins=10) +
labs(title="Number of Siblings/Spouses Aboard for Male")
p1<-p+scale_fill_manual(values=c("#8dd3c7", "#fb8072"))
p2<-p1+scale_color_manual(values=c("#8dd3c7","#fb8072"))
grid.arrange(h2, ncol=1, nrow =1)
grid.arrange(q2, p2, ncol=1, nrow =2)
mosaicplot(Parch ~ Survived,
data=dt.train,
color=c("#8dd3c7", "#fb8072"), shade=FALSE, xlab="", ylab="",
off=c(0), cex.axis=1.4,
main="Titanic Training Data\nPassenger Survival by the Number of Parents/Children Aboard")
table(dt.train$Parch,dt.train$Survived)
##
## Survived Died
## 0 233 445
## 1 65 53
## 2 40 40
## 3 3 2
## 4 0 4
## 5 1 4
## 6 0 1
round(prop.table(table(dt.train$Parch,dt.train$Survived),1),3)
##
## Survived Died
## 0 0.344 0.656
## 1 0.551 0.449
## 2 0.500 0.500
## 3 0.600 0.400
## 4 0.000 1.000
## 5 0.200 0.800
## 6 0.000 1.000
h<-ggplot(dt.train,aes(x=Parch, fill=Survived, color=Survived)) +
geom_histogram(position="identity", alpha=0.5,bins=10) +
labs(title="Titanic Training Data: \nNumber of Parents/Children Aboard by Variable Survived")
h1<-h+scale_fill_manual(values=c("#8dd3c7", "#fb8072"))
h2<-h1+scale_color_manual(values=c("#8dd3c7","#fb8072"))
q<-ggplot(subset(dt.train, dt.train$Sex=="Female"),aes(x=Parch, fill=Survived, color=Survived)) +
geom_histogram(position="identity", alpha=0.5,bins=10) +
labs(title="Number of Parents/Children Aboard for Female")
q1<-q+scale_fill_manual(values=c("#8dd3c7", "#fb8072"))
q2<-q1+scale_color_manual(values=c("#8dd3c7","#fb8072"))
p<-ggplot(subset(dt.train, dt.train$Sex=="Male"),aes(x=Parch, fill=Survived, color=Survived)) +
geom_histogram(position="identity", alpha=0.5,bins=10) +
labs(title="Number of Parents/Children Aboard for Male")
p1<-p+scale_fill_manual(values=c("#8dd3c7", "#fb8072"))
p2<-p1+scale_color_manual(values=c("#8dd3c7","#fb8072"))
grid.arrange(h2, ncol=1, nrow =1)
grid.arrange(q2, p2, ncol=1, nrow =2)
Most of the passengers embarked at the Southampton, followed by Cherbourg and Queenstown ports, respectivelly. Even though the highest proportion of survivor came from Cherbourg.
dt.train$Embarked[which(is.na(dt.train$Embarked))] <- 'S'# The most commom value
mosaicplot(Embarked ~ Survived,
data=dt.train,
color=c("#8dd3c7", "#fb8072"), shade=FALSE, xlab="", ylab="",
off=c(0), cex.axis=1.4,
main="Titanic Training Data\nPassenger Survival by Port of Embarkation")
table(dt.train$Embarked,dt.train$Survived)
##
## Survived Died
## C 93 75
## Q 30 47
## S 219 427
round(prop.table(table(dt.train$Embarked,dt.train$Survived),1),3)
##
## Survived Died
## C 0.554 0.446
## Q 0.390 0.610
## S 0.339 0.661
There are several methods of imputation. For the age, I choose to use the median based on the title. Whivh I understend to have some relationship with the age of the individual.
The variable title can be extracted from the name of each individual. Grab title from passenger names.
dt.train$Title <- gsub('(.*, )|(\\..*)', '', dt.train$Name)
table(dt.train$Title)
##
## Capt Col Don Dr Jonkheer
## 1 2 1 7 1
## Lady Major Master Miss Mlle
## 1 2 40 182 2
## Mme Mr Mrs Ms Rev
## 1 517 125 1 6
## Sir the Countess
## 1 1
options(digits=2)
with(dt.train,bystats(Age, Title,
fun=function(x)c(Mean=mean(x),Median=median(x))))
##
## c(6, 13, 6, 55, 13, 55, 6, 6) of Age by Title
##
## N Missing Mean Median
## Capt 1 0 70.0 70.0
## Col 2 0 58.0 58.0
## Don 1 0 40.0 40.0
## Dr 6 1 42.0 46.5
## Jonkheer 1 0 38.0 38.0
## Lady 1 0 48.0 48.0
## Major 2 0 48.5 48.5
## Master 36 4 4.6 3.5
## Miss 146 36 21.8 21.0
## Mlle 2 0 24.0 24.0
## Mme 1 0 24.0 24.0
## Mr 398 119 32.4 30.0
## Mrs 108 17 35.9 35.0
## Ms 1 0 28.0 28.0
## Rev 6 0 43.2 46.5
## Sir 1 0 49.0 49.0
## the Countess 1 0 33.0 33.0
## ALL 714 177 29.7 28.0
I found on the internet the follow imputeMedian function and it described as a function that receive the variable with missing (VarImpute), the variable used as filter (VarFilter) for the median imputation, and the levels of the filter variable (VarLevels).
imputeMedian <- function(VarImpute, VarFilter, VarLevels) {
for (i in VarLevels) {
VarImpute[ which(VarFilter == i)] <- impute(VarImpute[
which( VarFilter == i)])
}
return (VarImpute)
}
unique(dt.train$Title)
## [1] "Mr" "Mrs" "Miss" "Master"
## [5] "Don" "Rev" "Dr" "Mme"
## [9] "Ms" "Major" "Lady" "Sir"
## [13] "Mlle" "Col" "Capt" "the Countess"
## [17] "Jonkheer"
## list of all titles
titles <- c("Mr","Mrs","Miss","Master","Don","Rev",
"Dr","Mme","Ms","Major","Lady","Sir",
"Mlle","Col","Capt","the Countess","Jonkheer","Dona")
dt.train$Age[which(dt.train$Title=="Dr")]
## [1] 44 54 23 32 50 NA 49
dt.train$Age <- imputeMedian(dt.train$Age,dt.train$Title,titles)
dt.train$Age[which(dt.train$Title=="Dr")] #Checking imputation
## [1] 44 54 23 32 50 46 49
h<-ggplot(dt.train,aes(x=Age, fill=Survived, color=Survived)) +
geom_histogram(position="identity", alpha=0.5,bins=90) +
labs(title="Titanic Training Data: Age by Variable Survived")
h1<-h+scale_fill_manual(values=c("#8dd3c7", "#fb8072"))
h2<-h1+scale_color_manual(values=c("#8dd3c7","#fb8072"))
q<-ggplot(subset(dt.train, dt.train$Sex=="Female"),aes(x=Age, fill=Survived, color=Survived)) +
geom_histogram(position="identity", alpha=0.5,bins=90) +
labs(title="Age of Female")
q1<-q+scale_fill_manual(values=c("#8dd3c7", "#fb8072"))
q2<-q1+scale_color_manual(values=c("#8dd3c7","#fb8072"))
p<-ggplot(subset(dt.train, dt.train$Sex=="Male"),aes(x=Age, fill=Survived, color=Survived)) +
geom_histogram(position="identity", alpha=0.5,bins=90) +
labs(title="Age of Male")
p1<-p+scale_fill_manual(values=c("#8dd3c7", "#fb8072"))
p2<-p1+scale_color_manual(values=c("#8dd3c7","#fb8072"))
grid.arrange(h2, ncol=1, nrow =1)
grid.arrange(q2, p2, ncol=1, nrow =2)
Evaluating the distribution of survivors per age by gender, aparently the age was not important if the individual was female. For male, the highest proportion of survivors occoured for passengers under the age of 15.
The follow histogram present the distribution of age per class.
q<-ggplot(dt.train, aes(x=Age, fill=Pclass)) +
geom_histogram(position="identity", alpha=0.5,bins=90) +
labs(title="Titanic Training Data: Age by Class")
q1<-q+scale_fill_manual(name="Class",values=c("blue","green", "red"))
q2<-q1+scale_color_manual(values=c("blue","green", "red"))
q2
Because the variable title has to many categories, we are going to create a new title according to the following rules.
dt.train$NewTitle[dt.train$Title %in% c("Capt","Col","Don", "Dr", "Major","Rev")] <- 0 #There are a Woman Dr
dt.train$NewTitle[dt.train$Title %in% c("Lady", "Mme", "Mrs", "Ms", "the Countess")] <- 1
dt.train$NewTitle[dt.train$Title %in% c("Master")] <- 2
dt.train$NewTitle[dt.train$Title %in% c("Miss", "Mlle")] <- 3
dt.train$NewTitle[dt.train$Title %in% c("Mr", "Sir", "Jonkheer")] <- 4
dt.train$NewTitle <- factor(dt.train$NewTitle)
dt.train$NewTitle <- as.factor(dt.train$NewTitle)
levels(dt.train$NewTitle) <- c("Special", "Mrs", "Master","Miss","Mr")
table(dt.train$NewTitle, dt.train$Survived)
##
## Survived Died
## Special 5 14
## Mrs 103 26
## Master 23 17
## Miss 129 55
## Mr 82 437
round(prop.table(table(dt.train$NewTitle, dt.train$Survived),1),3)
##
## Survived Died
## Special 0.26 0.74
## Mrs 0.80 0.20
## Master 0.57 0.42
## Miss 0.70 0.30
## Mr 0.16 0.84
The tables above seems that the new title follow the idea to code “children and women first”. Where Master (boys under the age 13), Miss and Mrs. have the highest survivor rate.
With this noted, I am going to create new variables that separate: children (under the age 13, independent of gender), adult women and adult male. children (under the age 15, independent of gender), adult women and adult male.
Based on the fact that during a disaster the priority are women and children first, we are going to create a variable that separate children, women, and men.
dt.train$WomanChild12_1st[dt.train$NewTitle %in% c("Master")] <- 0
dt.train$WomanChild12_1st[dt.train$NewTitle %in% c("Miss") & dt.train$Age<=12] <- 0
dt.train$WomanChild12_1st[dt.train$NewTitle %in% c("Miss") & dt.train$Age>12] <- 1
dt.train$WomanChild12_1st[dt.train$NewTitle %in% c("Mrs")] <- 1
dt.train$WomanChild12_1st[dt.train$NewTitle %in% c("Special") & dt.train$Sex=="Female"] <- 1 #For example for a Dr Woman
dt.train$WomanChild12_1st[dt.train$NewTitle %in% c("Special") & dt.train$Sex=="Male"] <- 2
dt.train$WomanChild12_1st[dt.train$NewTitle %in% c("Mr")] <- 2
dt.train$WomanChild12_1st <- as.factor(dt.train$WomanChild12_1st)
levels(dt.train$WomanChild12_1st) <- c("Children", "Women", "Men")
table(dt.train$WomanChild12_1st, dt.train$Survived)
##
## Survived Died
## Children 42 30
## Women 214 68
## Men 86 451
round(prop.table(table(dt.train$WomanChild12_1st, dt.train$Survived),1),3)
##
## Survived Died
## Children 0.58 0.42
## Women 0.76 0.24
## Men 0.16 0.84
table(dt.train$WomanChild12_1st, dt.train$NewTitle)
##
## Special Mrs Master Miss Mr
## Children 0 0 40 32 0
## Women 1 129 0 152 0
## Men 18 0 0 0 519
round(prop.table(table(dt.train$WomanChild12_1st, dt.train$NewTitle),1),3)
##
## Special Mrs Master Miss Mr
## Children 0.000 0.000 0.556 0.444 0.000
## Women 0.004 0.457 0.000 0.539 0.000
## Men 0.034 0.000 0.000 0.000 0.966
h<-ggplot(dt.train,aes(x = WomanChild12_1st, fill = Survived,y = (..count..))) +
geom_bar() + labs(y = "Count")+
labs(title="Titanic Training Data: Women and Children 1st Survival",x="")
h1<-h+scale_fill_manual(name="Women & Children (< 13 years)\nFirst",values=c("#8dd3c7","#fb8072"))
h2<-h1+scale_color_manual(values=c("#8dd3c7","#fb8072"))
dt.train$WomanChild14_1st[dt.train$NewTitle %in% c("Master")] <-0
dt.train$WomanChild14_1st[dt.train$NewTitle %in% c("Miss") & dt.train$Age<=14] <- 0
dt.train$WomanChild14_1st[dt.train$NewTitle %in% c("Miss") & dt.train$Age>14] <- 1
dt.train$WomanChild14_1st[dt.train$NewTitle %in% c("Mrs")] <- 1
dt.train$WomanChild14_1st[dt.train$NewTitle %in% c("Special") & dt.train$Sex=="Female"] <- 1 #For example for a Dr Woman
dt.train$WomanChild14_1st[dt.train$NewTitle %in% c("Special") & dt.train$Sex=="Male"] <- 2
dt.train$WomanChild14_1st[dt.train$NewTitle %in% c("Mr") & dt.train$Age<=14] <- 0
dt.train$WomanChild14_1st[dt.train$NewTitle %in% c("Mr") & dt.train$Age>14] <- 2
dt.train$WomanChild14_1st <- as.factor(dt.train$WomanChild14_1st)
levels(dt.train$WomanChild14_1st) <- c("Children", "Women", "Men")
table(dt.train$WomanChild14_1st, dt.train$Survived)
##
## Survived Died
## Children 46 34
## Women 210 67
## Men 86 448
round(prop.table(table(dt.train$WomanChild14_1st, dt.train$Survived),1),3)
##
## Survived Died
## Children 0.57 0.42
## Women 0.76 0.24
## Men 0.16 0.84
table(dt.train$WomanChild14_1st, dt.train$NewTitle)
##
## Special Mrs Master Miss Mr
## Children 0 0 40 37 3
## Women 1 129 0 147 0
## Men 18 0 0 0 516
round(prop.table(table(dt.train$WomanChild14_1st, dt.train$NewTitle),1),3)
##
## Special Mrs Master Miss Mr
## Children 0.000 0.000 0.500 0.462 0.038
## Women 0.004 0.466 0.000 0.531 0.000
## Men 0.034 0.000 0.000 0.000 0.966
q<-ggplot(dt.train,aes(x = WomanChild14_1st, fill = Survived,y = (..count..))) +
geom_bar() + labs(y = "Count")+
labs(title="Titanic Training Data: Survival of Women and Children First code",x="")
q1<-q+scale_fill_manual(name="Women & Children (< 15 years)\nFirst",values=c("#8dd3c7","#fb8072"))
q2<-q1+scale_color_manual(values=c("#8dd3c7","#fb8072"))
grid.arrange(q2 ,ncol=1, nrow =1)
p<-ggplot(dt.train, aes(x=Age, fill=WomanChild12_1st)) +
geom_histogram(position="identity", alpha=0.5,bins=90) +
labs(title="Titanic Training Data: Survival of Women and Children First code")
p1<-p+scale_fill_manual(name="Women & Children (< 13 years)\nFirst",values=c("green","blue", "pink"))
p2<-p1+scale_color_manual(values=c("green","blue", "pink"))
q<-ggplot(dt.train, aes(x=Age, fill=WomanChild14_1st)) +
geom_histogram(position="identity", alpha=0.5,bins=90) +
labs(title="Titanic Training Data: Survival of Women and Children First code")
q1<-q+scale_fill_manual(name="Women & Children (< 15 years)\nFirst",values=c("green","blue", "pink"))
q2<-q1+scale_color_manual(values=c("green","blue", "pink"))
grid.arrange(p2,q2 ,ncol=1, nrow =2)
For the traning dataset, it does not seem to have difference between choosing the children age up to 12 or 14 years old. But I will check if there is any difference in the models.
We also observed the proportion of survivors is higher for adult women, followed by children and last by men.
As we belive that a passenger with a family had more chance to survive, we are going to evaluate if the size of the family matters. For that, I will create a variable that counts the number of family member on the Titanic (combining number of children, siblings, parents and spouses).
dt.train$FamilySize <- dt.train$SibSp + dt.train$Parch + 1 # Passeger + #Siblings + Spouse + Parents + Children Aboard
boxplot(Age ~ FamilySize, data =dt.train, xlab="Family Size on the Ship",
ylab="Age (years)", main="Titanic Training Data")
q <- ggplot(dt.train, aes(x=FamilySize, y=Age)) + geom_jitter(aes(colour = Survived),width = 0.25)
q1 <- q + xlab("Family Size")
q2 <- q1 + ylab("Age (years)")
q2
From the previous graphs, we can see that people with larger families were younger and had higher chance to die.
It seems that a passengers alone or with 5 o more family mamber on the ship are more likely to die, an individuo with a family with 2, 3 or 4 are more likely to survive. Because of that, I am going to categorize the family size, as follows.
dt.train$Fsize[dt.train$FamilySize == 1] <- 1
dt.train$Fsize[dt.train$FamilySize == 2] <- 2
dt.train$Fsize[dt.train$FamilySize == 3] <- 3
dt.train$Fsize[dt.train$FamilySize == 4] <- 4
dt.train$Fsize[dt.train$FamilySize >= 5] <- 5
dt.train$Fsize <- as.factor(dt.train$Fsize)
levels(dt.train$Fsize) <- c("1", "2", "3","4","5+")
table(dt.train$Fsize, dt.train$Survived)
##
## Survived Died
## 1 163 374
## 2 89 72
## 3 59 43
## 4 21 8
## 5+ 10 52
round(prop.table(table(dt.train$Fsize, dt.train$Survived),1),3)
##
## Survived Died
## 1 0.30 0.70
## 2 0.55 0.45
## 3 0.58 0.42
## 4 0.72 0.28
## 5+ 0.16 0.84
with(dt.train,table(Fsize, Sex))
## Sex
## Fsize Female Male
## 1 126 411
## 2 87 74
## 3 49 53
## 4 19 10
## 5+ 33 29
round(prop.table(table(dt.train$Fsize, dt.train$Sex),1),3)
##
## Female Male
## 1 0.23 0.76
## 2 0.54 0.46
## 3 0.48 0.52
## 4 0.66 0.34
## 5+ 0.53 0.47
h<-ggplot(dt.train,aes(x = Fsize, fill = Survived,y = (..count..))) +
geom_bar() + labs(y = "Count")+
labs(title="Titanic Training Data: Survived by Family Size on the Ship")
h1<-h+scale_fill_manual(values=c("#8dd3c7", "#fb8072"))
h2<-h1+scale_color_manual(values=c("#8dd3c7","#fb8072"))
grid.arrange(h2, ncol=1, nrow =1)
q <- ggplot(dt.train, aes(x=Fsize, y=Age)) + geom_jitter(aes(colour = Survived),width = 0.25)
q1 <- q+ xlab("Family Size")
q2 <- q1 + ylab("Age (years)")
grid.arrange(q2, ncol=1, nrow =1)
Just for curiosity, I created a variable that estimates the family size on the training or test dataset. I did that so my models could adjust to the size of the family in the dataset evaluated.
First a created the FamilyID pasting the Family size Aboard on Titanic with the Passenger’s Surname
dt.train$FamilyName <- gsub(",.*$", "", dt.train$Name)
dt.train$FamilyID <- paste(as.character(dt.train$FamilySize), dt.train$FamilyName, sep="")
With the FamilyID, we can see that even though I got the information that the Sage Family had 11 family members aboard on the Titanic, we had information of only 7 members and all died. Maybe the other 4 members had survived and are on the test dataset.
The follow variable is meant to give a unique family identification for each passenger. For this analysis, we will assume that all the family member have the same number of family member on the Titanic, same Surname, same Embark port, and the same Ticket number.
Families with diferent Ticket numbers or who have Embarked in different port won’t be classified as family.
dt.train$FamilyID_Embk_Ticket <- paste(dt.train$FamilyID,dt.train$Embarked, as.character(dt.train$Ticket), sep="_")
dt.train$FamilyID_dataSet <- match(dt.train$FamilyID_Embk_Ticket, unique(dt.train$FamilyID_Embk_Ticket))
dt.train$FamilySize_dataSet <- ave(dt.train$FamilyID_dataSet,dt.train$FamilyID_dataSet, FUN =length)
summary(dt.train$FamilySize_dataSet)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 1.0 1.0 1.6 2.0 7.0
table(dt.train$FamilySize_dataSet,dt.train$FamilySize)
##
## 1 2 3 4 5 6 7 8 11
## 1 533 81 33 4 1 1 1 0 0
## 2 4 80 42 8 2 0 0 0 0
## 3 0 0 27 9 0 0 0 0 0
## 4 0 0 0 8 12 4 4 0 0
## 5 0 0 0 0 0 5 0 0 0
## 6 0 0 0 0 0 12 0 6 0
## 7 0 0 0 0 0 0 7 0 7
plot(dt.train$FamilySize_dataSet,dt.train$FamilySize, xlab="Family Size in the dataset",
ylab="Family Size on the Ship",main= "Titanic Training dataset")
As we were expecting, the variable family size on the train dataset worked well, assuming values equal or less than in the Titanic.
with(dt.train,bystats(Fare, Pclass,
fun=function(x)c(Mean=mean(x),Median=median(x))))
##
## c(2, 13, 2, 55, 13, 55, 2, 2) of Fare by Pclass
##
## N Mean Median
## 1st Class 216 84 60.3
## 2nd Class 184 21 14.2
## 3rd Class 491 14 8.1
## ALL 891 32 14.5
q<-ggplot(dt.train, aes(x=Fare, fill=Pclass)) +
geom_histogram(position="identity", alpha=0.5,bins=50) +
labs(title="Titanic Training Data: Fare by Class")
q1<-q+scale_fill_manual(name="Class",values=c("green","blue", "red"))
q2<-q1+scale_color_manual(values=c("green","blue", "red"))
grid.arrange(q2, ncol=1, nrow =1)
Checking the ticket price(fare) by class, apparently the median increase with the class.
with(dt.train,bystats(Fare, FamilySize,
fun=function(x)c(Mean=mean(x),Median=median(x))))
##
## c(2, 13, 2, 55, 13, 55, 2, 2) of Fare by FamilySize
##
## N Mean Median
## 1 537 21 8.1
## 2 161 50 26.0
## 3 102 40 24.1
## 4 29 55 27.8
## 5 15 58 25.5
## 6 22 74 29.1
## 7 12 29 31.3
## 8 6 47 46.9
## 11 7 70 69.5
## ALL 891 32 14.5
with(dt.train, {
boxplot(Fare ~ FamilySize, xlab="Family Size on the Titanic",
ylab="Fare", main="Titanic Training Data", col=2:10)
})
with(dt.train,bystats(Fare, Fsize,
fun=function(x)c(Mean=mean(x),Median=median(x))))
##
## c(10, 13, 10, 55, 13, 55, 10, 10) of Fare by Fsize
##
## N Mean Median
## 1 537 21 8.1
## 2 161 50 26.0
## 3 102 40 24.1
## 4 29 55 27.8
## 5+ 62 58 31.4
## ALL 891 32 14.5
with(dt.train, {
boxplot(Fare ~ Fsize, xlab="Family Size on the Titanic",
ylab="Fare", main="Titanic Training Data", col=2:10)
})
From the boxplots above, the fare tends to be different only if the passenger were alone, with the lowest median.
To submit the models predictions on the Kaggle competition, I choose the ones with highest accuracy in the training dataset.
Table 1: Accuracy of adjusted models on the training datase.
| Model | Logistic | Decision Tree | Random Forest |
|---|---|---|---|
| 1 | 0.787 | 0.787 | 0.787 |
| 2 | 0.787 | 0.792 | 0.798 |
| 3 | 0.796 | 0.820 | 0.818 |
| 4 | 0.790 | 0.835 | 0.857 |
| 5 | 0.793 | 0.835 | 0.850 |
| 6 | 0.804 | 0.835 | 0.869 |
| 7 | 0.796 | 0.840 | 0.848 |
| 8 | 0.806 | 0.840 | 0.864 |
| 9 | 0.826 | 0.841 | 0.850 |
| 10 | 0.819 | 0.841 | 0.868 |
| 11 | 0.792 | 0.832 | 0.846 |
| 12 | 0.810 | 0.835 | 0.864 |
| 13 | 0.833 | 0.834 | 0.852 |
| 14 | 0.831 | 0.834 | 0.861 |
| 15 | 0.834 | 0.835 | 0.844 |
| 16 | 0.827 | 0.835 | 0.860 |
| 17 | 0.832 | 0.834 | 0.844 |
| 18 | 0.832 | 0.834 | 0.834 |
| 19 | 0.833 | 0.835 | 0.834 |
| 20 | 0.833 |
From the table above, I selected the following models. * Model 9: Decision Tree (Survived ~ Sex + Age + Pclass + Fsize) ** (accuracy in the train dataset = 0.841, Kaggle score = 0.79426)
* Model 6: Random Forest (Survived ~ Sex + Age + Pclass + SibSp + Parch + Embarked)
** (accuracy in the train dataset = 0.869, Kaggle score = 0.79426)
* Model 19: Logistic (Survived ~ Pclass + Fsize + WomanChild12_1st)
** (accuracy in the train dataset = 0.833, Kaggle score = 0.78947)
* Model 13: Logistic (Survived ~ Sex + Age + Pclass + Fsize + NewTitle)
** (accuracy in the train dataset = 0.833, Kaggle score = 0.78469)
* Model 20: Logistic with Stepwise
**(Survived ~ Age + Pclass + Fsize + FamilySize_dataSet + WomanChild12_1st)
** (accuracy in the train dataset = 0.833, Kaggle score = 0.78469)
I did the analysis of the Titanic data to predict who died and survived based on selected variables. The best models used variables as sex, age, class, family size, in dataset family size, title and the “children and women first” variable.
I am confortable to say that the gender, age and class are the major factors for survival or death in the Titanic tragedy. But also other factors interfered as the title, family size and embarked port.
The best models, acoording to the Kaggle score, are the 9 (decision tree) and 6 (random Forest), which land me on the top 28%. According to the training dataset, the accuracy for the random Forest is higher than the for the decision tree.
For future work, would be interesting evaluate the distance between the passenger’s cabin to the saving boats. The variable can be created using the ticket number to identify the cabin position in the ship and compute a vectorial distance. This is iteresting because, in a moment of desperate, people closer to the saving boats may had higher chances to survive. From the figures in the begging of this work, we see that first and second classes are closer to the life boats.
set.seed(12345)
fit1.log <- glm(Survived ~ Sex , family = binomial(link='logit'), data = dt.train)
summary(fit1.log)
##
## Call:
## glm(formula = Survived ~ Sex, family = binomial(link = "logit"),
## data = dt.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.826 -0.772 0.647 0.647 1.646
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.057 0.129 -8.19 2.6e-16 ***
## SexMale 2.514 0.167 15.04 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.7 on 890 degrees of freedom
## Residual deviance: 917.8 on 889 degrees of freedom
## AIC: 921.8
##
## Number of Fisher Scoring iterations: 4
dt.train$pred.fit1.log <- predict.glm(fit1.log, newdata = dt.train, type = "response")
dt.train$pred.fit1.log <- ifelse(dt.train$pred.fit1.log > 0.5,1,0)
dt.train$pred.fit1.log <- factor(dt.train$pred.fit1.log, levels = c(0,1),labels = c("Survived","Died"))
confusionMatrix(dt.train$pred.fit1.log, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 233 81
## Died 109 468
##
## Accuracy : 0.787
## 95% CI : (0.758, 0.813)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.542
## Mcnemar's Test P-Value : 0.0501
##
## Sensitivity : 0.681
## Specificity : 0.852
## Pos Pred Value : 0.742
## Neg Pred Value : 0.811
## Prevalence : 0.384
## Detection Rate : 0.262
## Detection Prevalence : 0.352
## Balanced Accuracy : 0.767
##
## 'Positive' Class : Survived
##
fit1.dt <- rpart(Survived ~ Sex, data=dt.train, method="class")
fancyRpartPlot(fit1.dt)
dt.train$pred.fit1.dt <- predict(fit1.dt, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit1.dt, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 233 81
## Died 109 468
##
## Accuracy : 0.787
## 95% CI : (0.758, 0.813)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.542
## Mcnemar's Test P-Value : 0.0501
##
## Sensitivity : 0.681
## Specificity : 0.852
## Pos Pred Value : 0.742
## Neg Pred Value : 0.811
## Prevalence : 0.384
## Detection Rate : 0.262
## Detection Prevalence : 0.352
## Balanced Accuracy : 0.767
##
## 'Positive' Class : Survived
##
fit1.rf <- randomForest(Survived ~ Sex,
data=dt.train, importance=TRUE, ntree=2000)
dt.train$pred.fit1.rf <- predict(fit1.rf, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit1.rf, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 233 81
## Died 109 468
##
## Accuracy : 0.787
## 95% CI : (0.758, 0.813)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.542
## Mcnemar's Test P-Value : 0.0501
##
## Sensitivity : 0.681
## Specificity : 0.852
## Pos Pred Value : 0.742
## Neg Pred Value : 0.811
## Prevalence : 0.384
## Detection Rate : 0.262
## Detection Prevalence : 0.352
## Balanced Accuracy : 0.767
##
## 'Positive' Class : Survived
##
fit2.log <- glm(Survived ~ Sex + Age , family = binomial(link='logit'), data = dt.train)
summary(fit2.log)
##
## Call:
## glm(formula = Survived ~ Sex + Age, family = binomial(link = "logit"),
## data = dt.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.933 -0.773 0.637 0.654 1.703
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.19213 0.21719 -5.49 4e-08 ***
## SexMale 2.50177 0.16769 14.92 <2e-16 ***
## Age 0.00489 0.00626 0.78 0.43
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.66 on 890 degrees of freedom
## Residual deviance: 917.19 on 888 degrees of freedom
## AIC: 923.2
##
## Number of Fisher Scoring iterations: 4
dt.train$pred.fit2.log <- predict.glm(fit2.log, newdata = dt.train, type = "response")
dt.train$pred.fit2.log <- ifelse(dt.train$pred.fit2.log > 0.5,1,0)
dt.train$pred.fit2.log <- factor(dt.train$pred.fit2.log, levels = c(0,1),labels = c("Survived","Died"))
confusionMatrix(dt.train$pred.fit2.log, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 233 81
## Died 109 468
##
## Accuracy : 0.787
## 95% CI : (0.758, 0.813)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.542
## Mcnemar's Test P-Value : 0.0501
##
## Sensitivity : 0.681
## Specificity : 0.852
## Pos Pred Value : 0.742
## Neg Pred Value : 0.811
## Prevalence : 0.384
## Detection Rate : 0.262
## Detection Prevalence : 0.352
## Balanced Accuracy : 0.767
##
## 'Positive' Class : Survived
##
fit2.dt <- rpart(Survived ~ Sex + Age, data=dt.train, method="class")
fancyRpartPlot(fit2.dt)
dt.train$pred.fit2.dt <- predict(fit2.dt, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit2.dt, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 256 99
## Died 86 450
##
## Accuracy : 0.792
## 95% CI : (0.764, 0.819)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.564
## Mcnemar's Test P-Value : 0.378
##
## Sensitivity : 0.749
## Specificity : 0.820
## Pos Pred Value : 0.721
## Neg Pred Value : 0.840
## Prevalence : 0.384
## Detection Rate : 0.287
## Detection Prevalence : 0.398
## Balanced Accuracy : 0.784
##
## 'Positive' Class : Survived
##
fit2.rf <- randomForest(Survived ~ Sex + Age,
data=dt.train, importance=TRUE, ntree=2000)
dt.train$pred.fit2.rf <- predict(fit2.rf, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit2.rf, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 245 83
## Died 97 466
##
## Accuracy : 0.798
## 95% CI : (0.77, 0.824)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.57
## Mcnemar's Test P-Value : 0.333
##
## Sensitivity : 0.716
## Specificity : 0.849
## Pos Pred Value : 0.747
## Neg Pred Value : 0.828
## Prevalence : 0.384
## Detection Rate : 0.275
## Detection Prevalence : 0.368
## Balanced Accuracy : 0.783
##
## 'Positive' Class : Survived
##
# Look at variable importance
varImpPlot(fit2.rf)
fit3.log <- glm(Survived ~ Sex + Age + Pclass, family = binomial(link='logit'), data = dt.train)
summary(fit3.log)
##
## Call:
## glm(formula = Survived ~ Sex + Age + Pclass, family = binomial(link = "logit"),
## data = dt.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.452 -0.632 0.411 0.661 2.669
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.60274 0.36306 -9.92 < 2e-16 ***
## SexMale 2.58635 0.18665 13.86 < 2e-16 ***
## Age 0.03513 0.00734 4.78 1.7e-06 ***
## Pclass2nd Class 1.14469 0.25786 4.44 9.0e-06 ***
## Pclass3rd Class 2.39139 0.24501 9.76 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.66 on 890 degrees of freedom
## Residual deviance: 802.66 on 886 degrees of freedom
## AIC: 812.7
##
## Number of Fisher Scoring iterations: 5
dt.train$pred.fit3.log <- predict.glm(fit3.log, newdata = dt.train, type = "response")
dt.train$pred.fit3.log <- ifelse(dt.train$pred.fit3.log > 0.5,1,0)
dt.train$pred.fit3.log <- factor(dt.train$pred.fit3.log, levels = c(0,1),labels = c("Survived","Died"))
confusionMatrix(dt.train$pred.fit3.log, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 236 76
## Died 106 473
##
## Accuracy : 0.796
## 95% CI : (0.768, 0.822)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.561
## Mcnemar's Test P-Value : 0.0316
##
## Sensitivity : 0.690
## Specificity : 0.862
## Pos Pred Value : 0.756
## Neg Pred Value : 0.817
## Prevalence : 0.384
## Detection Rate : 0.265
## Detection Prevalence : 0.350
## Balanced Accuracy : 0.776
##
## 'Positive' Class : Survived
##
fit3.dt <- rpart(Survived ~ Sex + Age + Pclass, data=dt.train, method="class")
fancyRpartPlot(fit3.dt)
dt.train$pred.fit3.dt <- predict(fit3.dt, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit3.dt, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 244 62
## Died 98 487
##
## Accuracy : 0.82
## 95% CI : (0.794, 0.845)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.613
## Mcnemar's Test P-Value : 0.00566
##
## Sensitivity : 0.713
## Specificity : 0.887
## Pos Pred Value : 0.797
## Neg Pred Value : 0.832
## Prevalence : 0.384
## Detection Rate : 0.274
## Detection Prevalence : 0.343
## Balanced Accuracy : 0.800
##
## 'Positive' Class : Survived
##
fit3.rf <- randomForest(Survived ~ Sex + Age + Pclass,
data=dt.train, importance=TRUE, ntree=2000)
dt.train$pred.fit3.rf <- predict(fit3.rf, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit3.rf, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 239 55
## Died 103 494
##
## Accuracy : 0.823
## 95% CI : (0.796, 0.847)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.615
## Mcnemar's Test P-Value : 0.000185
##
## Sensitivity : 0.699
## Specificity : 0.900
## Pos Pred Value : 0.813
## Neg Pred Value : 0.827
## Prevalence : 0.384
## Detection Rate : 0.268
## Detection Prevalence : 0.330
## Balanced Accuracy : 0.799
##
## 'Positive' Class : Survived
##
# Look at variable importance
varImpPlot(fit3.rf)
## Model 4: Survived ~ Sex + Age + Pclass + SibSp ### Logistic (Accuracy : 0.79)
fit4.log <- glm(Survived ~ Sex + Age + Pclass + SibSp, family = binomial(link='logit'), data = dt.train)
summary(fit4.log)
##
## Call:
## glm(formula = Survived ~ Sex + Age + Pclass + SibSp, family = binomial(link = "logit"),
## data = dt.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.485 -0.622 0.413 0.597 2.722
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.14231 0.40050 -10.34 < 2e-16 ***
## SexMale 2.71767 0.19423 13.99 < 2e-16 ***
## Age 0.04290 0.00783 5.48 4.2e-08 ***
## Pclass2nd Class 1.22563 0.26248 4.67 3.0e-06 ***
## Pclass3rd Class 2.42622 0.24720 9.81 < 2e-16 ***
## SibSp 0.37558 0.10354 3.63 0.00029 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.66 on 890 degrees of freedom
## Residual deviance: 786.49 on 885 degrees of freedom
## AIC: 798.5
##
## Number of Fisher Scoring iterations: 5
dt.train$pred.fit4.log <- predict.glm(fit4.log, newdata = dt.train, type = "response")
dt.train$pred.fit4.log <- ifelse(dt.train$pred.fit4.log > 0.5,1,0)
dt.train$pred.fit4.log <- factor(dt.train$pred.fit4.log, levels = c(0,1),labels = c("Survived","Died"))
confusionMatrix(dt.train$pred.fit4.log, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 241 86
## Died 101 463
##
## Accuracy : 0.79
## 95% CI : (0.762, 0.816)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.553
## Mcnemar's Test P-Value : 0.306
##
## Sensitivity : 0.705
## Specificity : 0.843
## Pos Pred Value : 0.737
## Neg Pred Value : 0.821
## Prevalence : 0.384
## Detection Rate : 0.270
## Detection Prevalence : 0.367
## Balanced Accuracy : 0.774
##
## 'Positive' Class : Survived
##
fit4.dt <- rpart(Survived ~ Sex + Age + Pclass + SibSp, data=dt.train, method="class")
fancyRpartPlot(fit4.dt)
dt.train$pred.fit4.dt <- predict(fit4.dt, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit4.dt, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 251 56
## Died 91 493
##
## Accuracy : 0.835
## 95% CI : (0.809, 0.859)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.644
## Mcnemar's Test P-Value : 0.00504
##
## Sensitivity : 0.734
## Specificity : 0.898
## Pos Pred Value : 0.818
## Neg Pred Value : 0.844
## Prevalence : 0.384
## Detection Rate : 0.282
## Detection Prevalence : 0.345
## Balanced Accuracy : 0.816
##
## 'Positive' Class : Survived
##
fit4.rf <- randomForest(Survived ~ Sex + Age + Pclass + SibSp,
data=dt.train, importance=TRUE, ntree=2000)
dt.train$pred.fit4.rf <- predict(fit4.rf, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit4.rf, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 253 40
## Died 89 509
##
## Accuracy : 0.855
## 95% CI : (0.83, 0.878)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.685
## Mcnemar's Test P-Value : 2.38e-05
##
## Sensitivity : 0.740
## Specificity : 0.927
## Pos Pred Value : 0.863
## Neg Pred Value : 0.851
## Prevalence : 0.384
## Detection Rate : 0.284
## Detection Prevalence : 0.329
## Balanced Accuracy : 0.833
##
## 'Positive' Class : Survived
##
# Look at variable importance
varImpPlot(fit4.rf)
fit5.log <- glm(Survived ~ Sex + Age + Pclass + SibSp + Parch, family = binomial(link='logit'), data = dt.train)
summary(fit5.log)
##
## Call:
## glm(formula = Survived ~ Sex + Age + Pclass + SibSp + Parch,
## family = binomial(link = "logit"), data = dt.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.464 -0.617 0.415 0.601 2.691
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.18334 0.40573 -10.31 < 2e-16 ***
## SexMale 2.74277 0.19857 13.81 < 2e-16 ***
## Age 0.04310 0.00784 5.50 3.9e-08 ***
## Pclass2nd Class 1.22574 0.26245 4.67 3.0e-06 ***
## Pclass3rd Class 2.42559 0.24703 9.82 < 2e-16 ***
## SibSp 0.35442 0.10811 3.28 0.001 **
## Parch 0.07396 0.11512 0.64 0.521
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.66 on 890 degrees of freedom
## Residual deviance: 786.08 on 884 degrees of freedom
## AIC: 800.1
##
## Number of Fisher Scoring iterations: 5
dt.train$pred.fit5.log <- predict.glm(fit5.log, newdata = dt.train, type = "response")
dt.train$pred.fit5.log <- ifelse(dt.train$pred.fit5.log > 0.5,1,0)
dt.train$pred.fit5.log <- factor(dt.train$pred.fit5.log, levels = c(0,1),labels = c("Survived","Died"))
confusionMatrix(dt.train$pred.fit5.log, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 242 84
## Died 100 465
##
## Accuracy : 0.793
## 95% CI : (0.765, 0.82)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.56
## Mcnemar's Test P-Value : 0.269
##
## Sensitivity : 0.708
## Specificity : 0.847
## Pos Pred Value : 0.742
## Neg Pred Value : 0.823
## Prevalence : 0.384
## Detection Rate : 0.272
## Detection Prevalence : 0.366
## Balanced Accuracy : 0.777
##
## 'Positive' Class : Survived
##
fit5.dt <- rpart(Survived ~ Sex + Age + Pclass + SibSp + Parch, data=dt.train, method="class")
fancyRpartPlot(fit5.dt)
dt.train$pred.fit5.dt <- predict(fit5.dt, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit5.dt, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 251 56
## Died 91 493
##
## Accuracy : 0.835
## 95% CI : (0.809, 0.859)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.644
## Mcnemar's Test P-Value : 0.00504
##
## Sensitivity : 0.734
## Specificity : 0.898
## Pos Pred Value : 0.818
## Neg Pred Value : 0.844
## Prevalence : 0.384
## Detection Rate : 0.282
## Detection Prevalence : 0.345
## Balanced Accuracy : 0.816
##
## 'Positive' Class : Survived
##
fit5.rf <- randomForest(Survived ~ Sex + Age + Pclass + SibSp + Parch,
data=dt.train, importance=TRUE, ntree=2000)
dt.train$pred.fit5.rf <- predict(fit5.rf, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit5.rf, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 255 41
## Died 87 508
##
## Accuracy : 0.856
## 95% CI : (0.832, 0.879)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.688
## Mcnemar's Test P-Value : 6.97e-05
##
## Sensitivity : 0.746
## Specificity : 0.925
## Pos Pred Value : 0.861
## Neg Pred Value : 0.854
## Prevalence : 0.384
## Detection Rate : 0.286
## Detection Prevalence : 0.332
## Balanced Accuracy : 0.835
##
## 'Positive' Class : Survived
##
# Look at variable importance
varImpPlot(fit5.rf)
fit6.log <- glm(Survived ~ Sex + Age + Pclass + SibSp + Parch + Embarked, family = binomial(link='logit'), data = dt.train)
summary(fit6.log)
##
## Call:
## glm(formula = Survived ~ Sex + Age + Pclass + SibSp + Parch +
## Embarked, family = binomial(link = "logit"), data = dt.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.478 -0.624 0.411 0.601 2.611
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.40437 0.43072 -10.23 < 2e-16 ***
## SexMale 2.71165 0.20116 13.48 < 2e-16 ***
## Age 0.04199 0.00786 5.34 9.1e-08 ***
## Pclass2nd Class 1.08184 0.27060 4.00 6.4e-05 ***
## Pclass3rd Class 2.35279 0.25590 9.19 < 2e-16 ***
## SibSp 0.33133 0.10835 3.06 0.0022 **
## Parch 0.07130 0.11666 0.61 0.5411
## EmbarkedQ 0.17881 0.38786 0.46 0.6448
## EmbarkedS 0.47233 0.23583 2.00 0.0452 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.66 on 890 degrees of freedom
## Residual deviance: 781.74 on 882 degrees of freedom
## AIC: 799.7
##
## Number of Fisher Scoring iterations: 5
dt.train$pred.fit6.log <- predict.glm(fit6.log, newdata = dt.train, type = "response")
dt.train$pred.fit6.log <- ifelse(dt.train$pred.fit6.log > 0.5,1,0)
dt.train$pred.fit6.log <- factor(dt.train$pred.fit6.log, levels = c(0,1),labels = c("Survived","Died"))
confusionMatrix(dt.train$pred.fit6.log, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 242 75
## Died 100 474
##
## Accuracy : 0.804
## 95% CI : (0.776, 0.829)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.579
## Mcnemar's Test P-Value : 0.0696
##
## Sensitivity : 0.708
## Specificity : 0.863
## Pos Pred Value : 0.763
## Neg Pred Value : 0.826
## Prevalence : 0.384
## Detection Rate : 0.272
## Detection Prevalence : 0.356
## Balanced Accuracy : 0.785
##
## 'Positive' Class : Survived
##
fit6.dt <- rpart(Survived ~ Sex + Age + Pclass + SibSp + Parch + Embarked, data=dt.train, method="class")
fancyRpartPlot(fit6.dt)
dt.train$pred.fit6.dt <- predict(fit6.dt, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit6.dt, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 222 27
## Died 120 522
##
## Accuracy : 0.835
## 95% CI : (0.809, 0.859)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.632
## Mcnemar's Test P-Value : 3.25e-14
##
## Sensitivity : 0.649
## Specificity : 0.951
## Pos Pred Value : 0.892
## Neg Pred Value : 0.813
## Prevalence : 0.384
## Detection Rate : 0.249
## Detection Prevalence : 0.279
## Balanced Accuracy : 0.800
##
## 'Positive' Class : Survived
##
fit6.rf <- randomForest(Survived ~ Sex + Age + Pclass + SibSp + Parch + Embarked,
data=dt.train, importance=TRUE, ntree=2000)
dt.train$pred.fit6.rf <- predict(fit6.rf, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit6.rf, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 251 25
## Died 91 524
##
## Accuracy : 0.87
## 95% CI : (0.846, 0.891)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.714
## Mcnemar's Test P-Value : 1.59e-09
##
## Sensitivity : 0.734
## Specificity : 0.954
## Pos Pred Value : 0.909
## Neg Pred Value : 0.852
## Prevalence : 0.384
## Detection Rate : 0.282
## Detection Prevalence : 0.310
## Balanced Accuracy : 0.844
##
## 'Positive' Class : Survived
##
# Look at variable importance
varImpPlot(fit6.rf)
fit7.log <- glm(Survived ~ Sex + Age + Pclass + FamilySize, family = binomial(link='logit'), data = dt.train)
summary(fit7.log)
##
## Call:
## glm(formula = Survived ~ Sex + Age + Pclass + FamilySize, family = binomial(link = "logit"),
## data = dt.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.437 -0.619 0.423 0.610 2.618
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.38689 0.43452 -10.10 < 2e-16 ***
## SexMale 2.76132 0.19773 13.97 < 2e-16 ***
## Age 0.04216 0.00782 5.39 7.0e-08 ***
## Pclass2nd Class 1.21007 0.26141 4.63 3.7e-06 ***
## Pclass3rd Class 2.41812 0.24652 9.81 < 2e-16 ***
## FamilySize 0.22745 0.06440 3.53 0.00041 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.66 on 890 degrees of freedom
## Residual deviance: 788.52 on 885 degrees of freedom
## AIC: 800.5
##
## Number of Fisher Scoring iterations: 5
dt.train$pred.fit7.log <- predict.glm(fit7.log, newdata = dt.train, type = "response")
dt.train$pred.fit7.log <- ifelse(dt.train$pred.fit7.log > 0.5,1,0)
dt.train$pred.fit7.log <- factor(dt.train$pred.fit7.log, levels = c(0,1),labels = c("Survived","Died"))
confusionMatrix(dt.train$pred.fit7.log, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 243 83
## Died 99 466
##
## Accuracy : 0.796
## 95% CI : (0.768, 0.822)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.564
## Mcnemar's Test P-Value : 0.266
##
## Sensitivity : 0.711
## Specificity : 0.849
## Pos Pred Value : 0.745
## Neg Pred Value : 0.825
## Prevalence : 0.384
## Detection Rate : 0.273
## Detection Prevalence : 0.366
## Balanced Accuracy : 0.780
##
## 'Positive' Class : Survived
##
fit7.dt <- rpart(Survived ~ Sex + Age + Pclass + FamilySize, data=dt.train, method="class")
fancyRpartPlot(fit7.dt)
dt.train$pred.fit7.dt <- predict(fit7.dt, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit7.dt, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 251 52
## Died 91 497
##
## Accuracy : 0.84
## 95% CI : (0.814, 0.863)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.653
## Mcnemar's Test P-Value : 0.00148
##
## Sensitivity : 0.734
## Specificity : 0.905
## Pos Pred Value : 0.828
## Neg Pred Value : 0.845
## Prevalence : 0.384
## Detection Rate : 0.282
## Detection Prevalence : 0.340
## Balanced Accuracy : 0.820
##
## 'Positive' Class : Survived
##
fit7.rf <- randomForest(Survived ~ Sex + Age + Pclass + FamilySize,
data=dt.train, importance=TRUE, ntree=2000)
dt.train$pred.fit7.rf <- predict(fit7.rf, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit7.rf, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 254 48
## Died 88 501
##
## Accuracy : 0.847
## 95% CI : (0.822, 0.87)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.67
## Mcnemar's Test P-Value : 0.000825
##
## Sensitivity : 0.743
## Specificity : 0.913
## Pos Pred Value : 0.841
## Neg Pred Value : 0.851
## Prevalence : 0.384
## Detection Rate : 0.285
## Detection Prevalence : 0.339
## Balanced Accuracy : 0.828
##
## 'Positive' Class : Survived
##
# Look at variable importance
varImpPlot(fit7.rf)
fit8.log <- glm(Survived ~ Sex + Age + Pclass + FamilySize + Embarked , family = binomial(link='logit'), data = dt.train)
summary(fit8.log)
##
## Call:
## glm(formula = Survived ~ Sex + Age + Pclass + FamilySize + Embarked,
## family = binomial(link = "logit"), data = dt.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.476 -0.616 0.416 0.634 2.539
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.61714 0.45979 -10.04 < 2e-16 ***
## SexMale 2.73222 0.20015 13.65 < 2e-16 ***
## Age 0.04115 0.00784 5.25 1.5e-07 ***
## Pclass2nd Class 1.06297 0.26932 3.95 7.9e-05 ***
## Pclass3rd Class 2.34208 0.25510 9.18 < 2e-16 ***
## FamilySize 0.21439 0.06541 3.28 0.001 **
## EmbarkedQ 0.21421 0.38624 0.55 0.579
## EmbarkedS 0.49533 0.23512 2.11 0.035 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.66 on 890 degrees of freedom
## Residual deviance: 783.82 on 883 degrees of freedom
## AIC: 799.8
##
## Number of Fisher Scoring iterations: 5
dt.train$pred.fit8.log <- predict.glm(fit8.log, newdata = dt.train, type = "response")
dt.train$pred.fit8.log <- ifelse(dt.train$pred.fit8.log > 0.5,1,0)
dt.train$pred.fit8.log <- factor(dt.train$pred.fit8.log, levels = c(0,1),labels = c("Survived","Died"))
confusionMatrix(dt.train$pred.fit8.log, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 241 72
## Died 101 477
##
## Accuracy : 0.806
## 95% CI : (0.778, 0.831)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.583
## Mcnemar's Test P-Value : 0.0333
##
## Sensitivity : 0.705
## Specificity : 0.869
## Pos Pred Value : 0.770
## Neg Pred Value : 0.825
## Prevalence : 0.384
## Detection Rate : 0.270
## Detection Prevalence : 0.351
## Balanced Accuracy : 0.787
##
## 'Positive' Class : Survived
##
fit8.dt <- rpart(Survived ~ Sex + Age + Pclass + FamilySize + Embarked , data=dt.train, method="class")
fancyRpartPlot(fit8.dt)
dt.train$pred.fit8.dt <- predict(fit8.dt, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit8.dt, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 251 52
## Died 91 497
##
## Accuracy : 0.84
## 95% CI : (0.814, 0.863)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.653
## Mcnemar's Test P-Value : 0.00148
##
## Sensitivity : 0.734
## Specificity : 0.905
## Pos Pred Value : 0.828
## Neg Pred Value : 0.845
## Prevalence : 0.384
## Detection Rate : 0.282
## Detection Prevalence : 0.340
## Balanced Accuracy : 0.820
##
## 'Positive' Class : Survived
##
fit8.rf <- randomForest(Survived ~ Sex + Age + Pclass + FamilySize + Embarked ,
data=dt.train, importance=TRUE, ntree=2000)
dt.train$pred.fit8.rf <- predict(fit8.rf, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit8.rf, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 242 20
## Died 100 529
##
## Accuracy : 0.865
## 95% CI : (0.841, 0.887)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.702
## Mcnemar's Test P-Value : 5.53e-13
##
## Sensitivity : 0.708
## Specificity : 0.964
## Pos Pred Value : 0.924
## Neg Pred Value : 0.841
## Prevalence : 0.384
## Detection Rate : 0.272
## Detection Prevalence : 0.294
## Balanced Accuracy : 0.836
##
## 'Positive' Class : Survived
##
# Look at variable importance
varImpPlot(fit8.rf)
fit9.log <- glm(Survived ~ Sex + Age + Pclass + Fsize, family = binomial(link='logit'), data = dt.train)
summary(fit9.log)
##
## Call:
## glm(formula = Survived ~ Sex + Age + Pclass + Fsize, family = binomial(link = "logit"),
## data = dt.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.631 -0.585 0.424 0.608 2.932
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.88104 0.42283 -9.18 < 2e-16 ***
## SexMale 2.75420 0.20286 13.58 < 2e-16 ***
## Age 0.03953 0.00812 4.87 1.1e-06 ***
## Pclass2nd Class 1.29046 0.26906 4.80 1.6e-06 ***
## Pclass3rd Class 2.30435 0.25311 9.10 < 2e-16 ***
## Fsize2 -0.03094 0.24242 -0.13 0.898
## Fsize3 -0.53537 0.28320 -1.89 0.059 .
## Fsize4 -0.48232 0.53810 -0.90 0.370
## Fsize5+ 2.13354 0.44171 4.83 1.4e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.66 on 890 degrees of freedom
## Residual deviance: 763.82 on 882 degrees of freedom
## AIC: 781.8
##
## Number of Fisher Scoring iterations: 5
dt.train$pred.fit9.log <- predict.glm(fit9.log, newdata = dt.train, type = "response")
dt.train$pred.fit9.log <- ifelse(dt.train$pred.fit9.log > 0.5,1,0)
dt.train$pred.fit9.log <- factor(dt.train$pred.fit9.log, levels = c(0,1),labels = c("Survived","Died"))
confusionMatrix(dt.train$pred.fit9.log, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 251 64
## Died 91 485
##
## Accuracy : 0.826
## 95% CI : (0.8, 0.85)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.627
## Mcnemar's Test P-Value : 0.0368
##
## Sensitivity : 0.734
## Specificity : 0.883
## Pos Pred Value : 0.797
## Neg Pred Value : 0.842
## Prevalence : 0.384
## Detection Rate : 0.282
## Detection Prevalence : 0.354
## Balanced Accuracy : 0.809
##
## 'Positive' Class : Survived
##
fit9.dt <- rpart(Survived ~ Sex + Age + Pclass + Fsize, data=dt.train, method="class")
fancyRpartPlot(fit9.dt)
dt.train$pred.fit9.dt <- predict(fit9.dt, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit9.dt, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 251 51
## Died 91 498
##
## Accuracy : 0.841
## 95% CI : (0.815, 0.864)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.655
## Mcnemar's Test P-Value : 0.00106
##
## Sensitivity : 0.734
## Specificity : 0.907
## Pos Pred Value : 0.831
## Neg Pred Value : 0.846
## Prevalence : 0.384
## Detection Rate : 0.282
## Detection Prevalence : 0.339
## Balanced Accuracy : 0.821
##
## 'Positive' Class : Survived
##
fit9.rf <- randomForest(Survived ~ Sex + Age + Pclass + Fsize,
data=dt.train, importance=TRUE, ntree=2000)
dt.train$pred.fit9.rf <- predict(fit9.rf, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit9.rf, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 254 45
## Died 88 504
##
## Accuracy : 0.851
## 95% CI : (0.826, 0.873)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.677
## Mcnemar's Test P-Value : 0.000271
##
## Sensitivity : 0.743
## Specificity : 0.918
## Pos Pred Value : 0.849
## Neg Pred Value : 0.851
## Prevalence : 0.384
## Detection Rate : 0.285
## Detection Prevalence : 0.336
## Balanced Accuracy : 0.830
##
## 'Positive' Class : Survived
##
# Look at variable importance
varImpPlot(fit9.rf)
fit10.log <- glm(Survived ~ Sex + Age + Pclass + Fsize + Embarked , family = binomial(link='logit'), data = dt.train)
summary(fit10.log)
##
## Call:
## glm(formula = Survived ~ Sex + Age + Pclass + Fsize + Embarked,
## family = binomial(link = "logit"), data = dt.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.637 -0.571 0.407 0.615 2.873
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.05512 0.45159 -8.98 < 2e-16 ***
## SexMale 2.72740 0.20585 13.25 < 2e-16 ***
## Age 0.03871 0.00812 4.77 1.9e-06 ***
## Pclass2nd Class 1.17911 0.27755 4.25 2.2e-05 ***
## Pclass3rd Class 2.25785 0.26128 8.64 < 2e-16 ***
## Fsize2 -0.00467 0.24529 -0.02 0.985
## Fsize3 -0.51743 0.28421 -1.82 0.069 .
## Fsize4 -0.48744 0.54125 -0.90 0.368
## Fsize5+ 2.04438 0.44870 4.56 5.2e-06 ***
## EmbarkedQ 0.08591 0.39529 0.22 0.828
## EmbarkedS 0.35531 0.24094 1.47 0.140
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.66 on 890 degrees of freedom
## Residual deviance: 761.31 on 880 degrees of freedom
## AIC: 783.3
##
## Number of Fisher Scoring iterations: 5
dt.train$pred.fit10.log <- predict.glm(fit10.log, newdata = dt.train, type = "response")
dt.train$pred.fit10.log <- ifelse(dt.train$pred.fit10.log > 0.5,1,0)
dt.train$pred.fit10.log <- factor(dt.train$pred.fit10.log, levels = c(0,1),labels = c("Survived","Died"))
confusionMatrix(dt.train$pred.fit10.log, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 249 68
## Died 93 481
##
## Accuracy : 0.819
## 95% CI : (0.792, 0.844)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.613
## Mcnemar's Test P-Value : 0.0586
##
## Sensitivity : 0.728
## Specificity : 0.876
## Pos Pred Value : 0.785
## Neg Pred Value : 0.838
## Prevalence : 0.384
## Detection Rate : 0.279
## Detection Prevalence : 0.356
## Balanced Accuracy : 0.802
##
## 'Positive' Class : Survived
##
fit10.dt <- rpart(Survived ~ Sex + Age + Pclass + Fsize + Embarked , data=dt.train, method="class")
fancyRpartPlot(fit10.dt)
dt.train$pred.fit10.dt <- predict(fit10.dt, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit10.dt, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 251 51
## Died 91 498
##
## Accuracy : 0.841
## 95% CI : (0.815, 0.864)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.655
## Mcnemar's Test P-Value : 0.00106
##
## Sensitivity : 0.734
## Specificity : 0.907
## Pos Pred Value : 0.831
## Neg Pred Value : 0.846
## Prevalence : 0.384
## Detection Rate : 0.282
## Detection Prevalence : 0.339
## Balanced Accuracy : 0.821
##
## 'Positive' Class : Survived
##
fit10.rf <- randomForest(Survived ~ Sex + Age + Pclass + Fsize + Embarked ,
data=dt.train, importance=TRUE, ntree=2000)
dt.train$pred.fit10.rf <- predict(fit10.rf, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit10.rf, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 245 20
## Died 97 529
##
## Accuracy : 0.869
## 95% CI : (0.845, 0.89)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.71
## Mcnemar's Test P-Value : 2.12e-12
##
## Sensitivity : 0.716
## Specificity : 0.964
## Pos Pred Value : 0.925
## Neg Pred Value : 0.845
## Prevalence : 0.384
## Detection Rate : 0.275
## Detection Prevalence : 0.297
## Balanced Accuracy : 0.840
##
## 'Positive' Class : Survived
##
# Look at variable importance
varImpPlot(fit10.rf)
fit11.log <- glm(Survived ~ Sex + Age + Pclass + FamilySize_dataSet, family = binomial(link='logit'), data = dt.train)
summary(fit11.log)
##
## Call:
## glm(formula = Survived ~ Sex + Age + Pclass + FamilySize_dataSet,
## family = binomial(link = "logit"), data = dt.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.451 -0.586 0.424 0.589 2.618
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.68388 0.45075 -10.39 < 2e-16 ***
## SexMale 2.77868 0.19746 14.07 < 2e-16 ***
## Age 0.04528 0.00803 5.64 1.7e-08 ***
## Pclass2nd Class 1.26631 0.26527 4.77 1.8e-06 ***
## Pclass3rd Class 2.42174 0.24821 9.76 < 2e-16 ***
## FamilySize_dataSet 0.39954 0.08992 4.44 8.9e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.66 on 890 degrees of freedom
## Residual deviance: 779.01 on 885 degrees of freedom
## AIC: 791
##
## Number of Fisher Scoring iterations: 5
dt.train$pred.fit11.log <- predict.glm(fit11.log, newdata = dt.train, type = "response")
dt.train$pred.fit11.log <- ifelse(dt.train$pred.fit11.log > 0.5,1,0)
dt.train$pred.fit11.log <- factor(dt.train$pred.fit11.log, levels = c(0,1),labels = c("Survived","Died"))
confusionMatrix(dt.train$pred.fit11.log, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 239 82
## Died 103 467
##
## Accuracy : 0.792
## 95% CI : (0.764, 0.819)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.556
## Mcnemar's Test P-Value : 0.141
##
## Sensitivity : 0.699
## Specificity : 0.851
## Pos Pred Value : 0.745
## Neg Pred Value : 0.819
## Prevalence : 0.384
## Detection Rate : 0.268
## Detection Prevalence : 0.360
## Balanced Accuracy : 0.775
##
## 'Positive' Class : Survived
##
fit11.dt <- rpart(Survived ~ Sex + Age + Pclass + FamilySize_dataSet, data=dt.train, method="class")
fancyRpartPlot(fit11.dt)
dt.train$pred.fit11.dt <- predict(fit11.dt, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit11.dt, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 253 61
## Died 89 488
##
## Accuracy : 0.832
## 95% CI : (0.805, 0.856)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.639
## Mcnemar's Test P-Value : 0.0275
##
## Sensitivity : 0.740
## Specificity : 0.889
## Pos Pred Value : 0.806
## Neg Pred Value : 0.846
## Prevalence : 0.384
## Detection Rate : 0.284
## Detection Prevalence : 0.352
## Balanced Accuracy : 0.814
##
## 'Positive' Class : Survived
##
fit11.rf <- randomForest(Survived ~ Sex + Age + Pclass + FamilySize_dataSet,
data=dt.train, importance=TRUE, ntree=2000)
dt.train$pred.fit11.rf <- predict(fit11.rf, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit11.rf, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 255 48
## Died 87 501
##
## Accuracy : 0.848
## 95% CI : (0.823, 0.871)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.673
## Mcnemar's Test P-Value : 0.00107
##
## Sensitivity : 0.746
## Specificity : 0.913
## Pos Pred Value : 0.842
## Neg Pred Value : 0.852
## Prevalence : 0.384
## Detection Rate : 0.286
## Detection Prevalence : 0.340
## Balanced Accuracy : 0.829
##
## 'Positive' Class : Survived
##
# Look at variable importance
varImpPlot(fit11.rf)
fit12.log <- glm(Survived ~ Sex + Age + Pclass + FamilySize_dataSet + Embarked , family = binomial(link='logit'), data = dt.train)
summary(fit12.log)
##
## Call:
## glm(formula = Survived ~ Sex + Age + Pclass + FamilySize_dataSet +
## Embarked, family = binomial(link = "logit"), data = dt.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.486 -0.590 0.421 0.596 2.542
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.89675 0.47318 -10.35 < 2e-16 ***
## SexMale 2.75329 0.19979 13.78 < 2e-16 ***
## Age 0.04422 0.00804 5.50 3.8e-08 ***
## Pclass2nd Class 1.12520 0.27324 4.12 3.8e-05 ***
## Pclass3rd Class 2.34726 0.25646 9.15 < 2e-16 ***
## FamilySize_dataSet 0.38359 0.09130 4.20 2.7e-05 ***
## EmbarkedQ 0.22032 0.38836 0.57 0.570
## EmbarkedS 0.46687 0.23528 1.98 0.047 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.7 on 890 degrees of freedom
## Residual deviance: 774.9 on 883 degrees of freedom
## AIC: 790.9
##
## Number of Fisher Scoring iterations: 5
dt.train$pred.fit12.log <- predict.glm(fit12.log, newdata = dt.train, type = "response")
dt.train$pred.fit12.log <- ifelse(dt.train$pred.fit12.log > 0.5,1,0)
dt.train$pred.fit12.log <- factor(dt.train$pred.fit12.log, levels = c(0,1),labels = c("Survived","Died"))
confusionMatrix(dt.train$pred.fit12.log, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 243 70
## Died 99 479
##
## Accuracy : 0.81
## 95% CI : (0.783, 0.836)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.592
## Mcnemar's Test P-Value : 0.0313
##
## Sensitivity : 0.711
## Specificity : 0.872
## Pos Pred Value : 0.776
## Neg Pred Value : 0.829
## Prevalence : 0.384
## Detection Rate : 0.273
## Detection Prevalence : 0.351
## Balanced Accuracy : 0.792
##
## 'Positive' Class : Survived
##
fit12.dt <- rpart(Survived ~ Sex + Age + Pclass + FamilySize_dataSet + Embarked , data=dt.train, method="class")
fancyRpartPlot(fit12.dt)
dt.train$pred.fit12.dt <- predict(fit12.dt, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit12.dt, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 221 26
## Died 121 523
##
## Accuracy : 0.835
## 95% CI : (0.809, 0.859)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.632
## Mcnemar's Test P-Value : 8.98e-15
##
## Sensitivity : 0.646
## Specificity : 0.953
## Pos Pred Value : 0.895
## Neg Pred Value : 0.812
## Prevalence : 0.384
## Detection Rate : 0.248
## Detection Prevalence : 0.277
## Balanced Accuracy : 0.799
##
## 'Positive' Class : Survived
##
fit12.rf <- randomForest(Survived ~ Sex + Age + Pclass + FamilySize_dataSet + Embarked ,
data=dt.train, importance=TRUE, ntree=2000)
dt.train$pred.fit12.rf <- predict(fit12.rf, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit12.rf, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 244 24
## Died 98 525
##
## Accuracy : 0.863
## 95% CI : (0.839, 0.885)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.698
## Mcnemar's Test P-Value : 3.87e-11
##
## Sensitivity : 0.713
## Specificity : 0.956
## Pos Pred Value : 0.910
## Neg Pred Value : 0.843
## Prevalence : 0.384
## Detection Rate : 0.274
## Detection Prevalence : 0.301
## Balanced Accuracy : 0.835
##
## 'Positive' Class : Survived
##
# Look at variable importance
varImpPlot(fit12.rf)
fit13.log <- glm(Survived ~ Sex + Age + Pclass + Fsize + NewTitle, family = binomial(link='logit'), data = dt.train)
summary(fit13.log)
##
## Call:
## glm(formula = Survived ~ Sex + Age + Pclass + Fsize + NewTitle,
## family = binomial(link = "logit"), data = dt.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.415 -0.522 0.400 0.535 2.697
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -14.84669 535.41138 -0.03 0.9779
## SexMale 14.32306 535.41150 0.03 0.9787
## Age 0.02614 0.00968 2.70 0.0069 **
## Pclass2nd Class 1.44925 0.29247 4.96 7.2e-07 ***
## Pclass3rd Class 2.38485 0.26895 8.87 < 2e-16 ***
## Fsize2 0.30835 0.26943 1.14 0.2524
## Fsize3 0.17462 0.32732 0.53 0.5937
## Fsize4 0.06919 0.58242 0.12 0.9054
## Fsize5+ 2.89729 0.46305 6.26 3.9e-10 ***
## NewTitleMrs 10.51332 535.41131 0.02 0.9843
## NewTitleMaster -3.57631 0.85033 -4.21 2.6e-05 ***
## NewTitleMiss 11.19556 535.41128 0.02 0.9833
## NewTitleMr -0.17772 0.61624 -0.29 0.7731
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.66 on 890 degrees of freedom
## Residual deviance: 717.63 on 878 degrees of freedom
## AIC: 743.6
##
## Number of Fisher Scoring iterations: 12
dt.train$pred.fit13.log <- predict.glm(fit13.log, newdata = dt.train, type = "response")
dt.train$pred.fit13.log <- ifelse(dt.train$pred.fit13.log > 0.5,1,0)
dt.train$pred.fit13.log <- factor(dt.train$pred.fit13.log, levels = c(0,1),labels = c("Survived","Died"))
confusionMatrix(dt.train$pred.fit13.log, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 252 59
## Died 90 490
##
## Accuracy : 0.833
## 95% CI : (0.807, 0.857)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.64
## Mcnemar's Test P-Value : 0.014
##
## Sensitivity : 0.737
## Specificity : 0.893
## Pos Pred Value : 0.810
## Neg Pred Value : 0.845
## Prevalence : 0.384
## Detection Rate : 0.283
## Detection Prevalence : 0.349
## Balanced Accuracy : 0.815
##
## 'Positive' Class : Survived
##
fit13.dt <- rpart(Survived ~ Sex + Age + Pclass + Fsize + NewTitle, data=dt.train, method="class")
fancyRpartPlot(fit13.dt)
dt.train$pred.fit13.dt <- predict(fit13.dt, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit13.dt, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 251 57
## Died 91 492
##
## Accuracy : 0.834
## 95% CI : (0.808, 0.858)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.642
## Mcnemar's Test P-Value : 0.00668
##
## Sensitivity : 0.734
## Specificity : 0.896
## Pos Pred Value : 0.815
## Neg Pred Value : 0.844
## Prevalence : 0.384
## Detection Rate : 0.282
## Detection Prevalence : 0.346
## Balanced Accuracy : 0.815
##
## 'Positive' Class : Survived
##
fit13.rf <- randomForest(Survived ~ Sex + Age + Pclass + Fsize + NewTitle,
data=dt.train, importance=TRUE, ntree=2000)
dt.train$pred.fit13.rf <- predict(fit13.rf, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit13.rf, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 253 47
## Died 89 502
##
## Accuracy : 0.847
## 95% CI : (0.822, 0.87)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.67
## Mcnemar's Test P-Value : 0.000439
##
## Sensitivity : 0.740
## Specificity : 0.914
## Pos Pred Value : 0.843
## Neg Pred Value : 0.849
## Prevalence : 0.384
## Detection Rate : 0.284
## Detection Prevalence : 0.337
## Balanced Accuracy : 0.827
##
## 'Positive' Class : Survived
##
# Look at variable importance
varImpPlot(fit13.rf)
fit14.log <- glm(Survived ~ Sex + Age + Pclass + Fsize + NewTitle + Embarked, family = binomial(link='logit'), data = dt.train)
summary(fit14.log)
##
## Call:
## glm(formula = Survived ~ Sex + Age + Pclass + Fsize + NewTitle +
## Embarked, family = binomial(link = "logit"), data = dt.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.449 -0.530 0.385 0.525 2.636
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -15.2108 535.4114 -0.03 0.9773
## SexMale 14.5116 535.4115 0.03 0.9784
## Age 0.0254 0.0097 2.62 0.0087 **
## Pclass2nd Class 1.3275 0.3010 4.41 1.0e-05 ***
## Pclass3rd Class 2.3445 0.2791 8.40 < 2e-16 ***
## Fsize2 0.3494 0.2729 1.28 0.2004
## Fsize3 0.1973 0.3279 0.60 0.5473
## Fsize4 0.0984 0.5861 0.17 0.8667
## Fsize5+ 2.8253 0.4690 6.02 1.7e-09 ***
## NewTitleMrs 10.6355 535.4113 0.02 0.9842
## NewTitleMaster -3.6220 0.8553 -4.23 2.3e-05 ***
## NewTitleMiss 11.3698 535.4113 0.02 0.9831
## NewTitleMr -0.2406 0.6230 -0.39 0.6994
## EmbarkedQ 0.0618 0.3994 0.15 0.8771
## EmbarkedS 0.3981 0.2508 1.59 0.1125
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.66 on 890 degrees of freedom
## Residual deviance: 714.55 on 876 degrees of freedom
## AIC: 744.5
##
## Number of Fisher Scoring iterations: 12
dt.train$pred.fit14.log <- predict.glm(fit14.log, newdata = dt.train, type = "response")
dt.train$pred.fit14.log <- ifelse(dt.train$pred.fit14.log > 0.5,1,0)
dt.train$pred.fit14.log <- factor(dt.train$pred.fit14.log, levels = c(0,1),labels = c("Survived","Died"))
confusionMatrix(dt.train$pred.fit14.log, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 257 66
## Died 85 483
##
## Accuracy : 0.831
## 95% CI : (0.804, 0.855)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.638
## Mcnemar's Test P-Value : 0.143
##
## Sensitivity : 0.751
## Specificity : 0.880
## Pos Pred Value : 0.796
## Neg Pred Value : 0.850
## Prevalence : 0.384
## Detection Rate : 0.288
## Detection Prevalence : 0.363
## Balanced Accuracy : 0.816
##
## 'Positive' Class : Survived
##
fit14.dt <- rpart(Survived ~ Sex + Age + Pclass + Fsize + NewTitle + Embarked, data=dt.train, method="class")
fancyRpartPlot(fit14.dt)
dt.train$pred.fit14.dt <- predict(fit14.dt, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit14.dt, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 251 57
## Died 91 492
##
## Accuracy : 0.834
## 95% CI : (0.808, 0.858)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.642
## Mcnemar's Test P-Value : 0.00668
##
## Sensitivity : 0.734
## Specificity : 0.896
## Pos Pred Value : 0.815
## Neg Pred Value : 0.844
## Prevalence : 0.384
## Detection Rate : 0.282
## Detection Prevalence : 0.346
## Balanced Accuracy : 0.815
##
## 'Positive' Class : Survived
##
fit14.rf <- randomForest(Survived ~ Sex + Age + Pclass + Fsize + NewTitle + Embarked,
data=dt.train, importance=TRUE, ntree=2000)
dt.train$pred.fit14.rf <- predict(fit14.rf, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit14.rf, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 244 26
## Died 98 523
##
## Accuracy : 0.861
## 95% CI : (0.836, 0.883)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.694
## Mcnemar's Test P-Value : 1.82e-10
##
## Sensitivity : 0.713
## Specificity : 0.953
## Pos Pred Value : 0.904
## Neg Pred Value : 0.842
## Prevalence : 0.384
## Detection Rate : 0.274
## Detection Prevalence : 0.303
## Balanced Accuracy : 0.833
##
## 'Positive' Class : Survived
##
# Look at variable importance
varImpPlot(fit14.rf)
## Model 15: Survived ~ Sex + Age + Pclass + Fsize + WomanChild12_1st ### Logistic (Accuracy : 0.834)
fit15.log <- glm(Survived ~ Sex + Age + Pclass + Fsize + WomanChild12_1st , family = binomial(link='logit'), data = dt.train)
summary(fit15.log)
##
## Call:
## glm(formula = Survived ~ Sex + Age + Pclass + Fsize + WomanChild12_1st,
## family = binomial(link = "logit"), data = dt.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.397 -0.550 0.396 0.557 2.658
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.44353 0.60498 -5.69 1.3e-08 ***
## SexMale -0.43497 0.65497 -0.66 0.507
## Age 0.02044 0.00932 2.19 0.028 *
## Pclass2nd Class 1.34719 0.28247 4.77 1.8e-06 ***
## Pclass3rd Class 2.31258 0.26417 8.75 < 2e-16 ***
## Fsize2 0.15872 0.25407 0.62 0.532
## Fsize3 0.02978 0.31829 0.09 0.925
## Fsize4 -0.10135 0.57676 -0.18 0.861
## Fsize5+ 2.80083 0.47454 5.90 3.6e-09 ***
## WomanChild12_1stWomen -0.17940 0.55928 -0.32 0.748
## WomanChild12_1stMen 3.46207 0.59717 5.80 6.7e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.7 on 890 degrees of freedom
## Residual deviance: 721.4 on 880 degrees of freedom
## AIC: 743.4
##
## Number of Fisher Scoring iterations: 5
dt.train$pred.fit15.log <- predict.glm(fit15.log, newdata = dt.train, type = "response")
dt.train$pred.fit15.log <- ifelse(dt.train$pred.fit15.log > 0.5,1,0)
dt.train$pred.fit15.log <- factor(dt.train$pred.fit15.log, levels = c(0,1),labels = c("Survived","Died"))
confusionMatrix(dt.train$pred.fit15.log, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 251 57
## Died 91 492
##
## Accuracy : 0.834
## 95% CI : (0.808, 0.858)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.642
## Mcnemar's Test P-Value : 0.00668
##
## Sensitivity : 0.734
## Specificity : 0.896
## Pos Pred Value : 0.815
## Neg Pred Value : 0.844
## Prevalence : 0.384
## Detection Rate : 0.282
## Detection Prevalence : 0.346
## Balanced Accuracy : 0.815
##
## 'Positive' Class : Survived
##
fit15.dt <- rpart(Survived ~ Sex + Age + Pclass + Fsize + WomanChild12_1st , data=dt.train, method="class")
fancyRpartPlot(fit15.dt)
dt.train$pred.fit15.dt <- predict(fit15.dt, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit15.dt, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 252 57
## Died 90 492
##
## Accuracy : 0.835
## 95% CI : (0.809, 0.859)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.645
## Mcnemar's Test P-Value : 0.00831
##
## Sensitivity : 0.737
## Specificity : 0.896
## Pos Pred Value : 0.816
## Neg Pred Value : 0.845
## Prevalence : 0.384
## Detection Rate : 0.283
## Detection Prevalence : 0.347
## Balanced Accuracy : 0.817
##
## 'Positive' Class : Survived
##
fit15.rf <- randomForest(Survived ~ Sex + Age + Pclass + Fsize + WomanChild12_1st ,
data=dt.train, importance=TRUE, ntree=2000)
dt.train$pred.fit15.rf <- predict(fit15.rf, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit15.rf, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 252 49
## Died 90 500
##
## Accuracy : 0.844
## 95% CI : (0.818, 0.867)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.663
## Mcnemar's Test P-Value : 0.000692
##
## Sensitivity : 0.737
## Specificity : 0.911
## Pos Pred Value : 0.837
## Neg Pred Value : 0.847
## Prevalence : 0.384
## Detection Rate : 0.283
## Detection Prevalence : 0.338
## Balanced Accuracy : 0.824
##
## 'Positive' Class : Survived
##
# Look at variable importance
varImpPlot(fit15.rf)
fit16.log <- glm(Survived ~ Sex + Age + Pclass + Fsize + WomanChild12_1st + Embarked, family = binomial(link='logit'), data = dt.train)
summary(fit16.log)
##
## Call:
## glm(formula = Survived ~ Sex + Age + Pclass + Fsize + WomanChild12_1st +
## Embarked, family = binomial(link = "logit"), data = dt.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.421 -0.553 0.381 0.544 2.580
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.65849 0.62451 -5.86 4.7e-09 ***
## SexMale -0.43462 0.65824 -0.66 0.51
## Age 0.01923 0.00936 2.05 0.04 *
## Pclass2nd Class 1.22344 0.29170 4.19 2.7e-05 ***
## Pclass3rd Class 2.24714 0.27248 8.25 < 2e-16 ***
## Fsize2 0.19485 0.25774 0.76 0.45
## Fsize3 0.05615 0.31922 0.18 0.86
## Fsize4 -0.05728 0.57910 -0.10 0.92
## Fsize5+ 2.74206 0.47945 5.72 1.1e-08 ***
## WomanChild12_1stWomen -0.15633 0.56590 -0.28 0.78
## WomanChild12_1stMen 3.47088 0.59879 5.80 6.8e-09 ***
## EmbarkedQ 0.19032 0.40385 0.47 0.64
## EmbarkedS 0.38593 0.24888 1.55 0.12
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.66 on 890 degrees of freedom
## Residual deviance: 718.87 on 878 degrees of freedom
## AIC: 744.9
##
## Number of Fisher Scoring iterations: 5
dt.train$pred.fit16.log <- predict.glm(fit16.log, newdata = dt.train, type = "response")
dt.train$pred.fit16.log <- ifelse(dt.train$pred.fit16.log > 0.5,1,0)
dt.train$pred.fit16.log <- factor(dt.train$pred.fit16.log, levels = c(0,1),labels = c("Survived","Died"))
confusionMatrix(dt.train$pred.fit16.log, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 254 66
## Died 88 483
##
## Accuracy : 0.827
## 95% CI : (0.801, 0.851)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.63
## Mcnemar's Test P-Value : 0.0906
##
## Sensitivity : 0.743
## Specificity : 0.880
## Pos Pred Value : 0.794
## Neg Pred Value : 0.846
## Prevalence : 0.384
## Detection Rate : 0.285
## Detection Prevalence : 0.359
## Balanced Accuracy : 0.811
##
## 'Positive' Class : Survived
##
fit16.dt <- rpart(Survived ~ Sex + Age + Pclass + Fsize + WomanChild12_1st + Embarked, data=dt.train, method="class")
fancyRpartPlot(fit16.dt)
dt.train$pred.fit16.dt <- predict(fit16.dt, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit16.dt, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 252 57
## Died 90 492
##
## Accuracy : 0.835
## 95% CI : (0.809, 0.859)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.645
## Mcnemar's Test P-Value : 0.00831
##
## Sensitivity : 0.737
## Specificity : 0.896
## Pos Pred Value : 0.816
## Neg Pred Value : 0.845
## Prevalence : 0.384
## Detection Rate : 0.283
## Detection Prevalence : 0.347
## Balanced Accuracy : 0.817
##
## 'Positive' Class : Survived
##
fit16.rf <- randomForest(Survived ~ Sex + Age + Pclass + Fsize + WomanChild12_1st + Embarked,
data=dt.train, importance=TRUE, ntree=2000)
dt.train$pred.fit16.rf <- predict(fit16.rf, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit16.rf, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 246 28
## Died 96 521
##
## Accuracy : 0.861
## 95% CI : (0.836, 0.883)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.694
## Mcnemar's Test P-Value : 1.78e-09
##
## Sensitivity : 0.719
## Specificity : 0.949
## Pos Pred Value : 0.898
## Neg Pred Value : 0.844
## Prevalence : 0.384
## Detection Rate : 0.276
## Detection Prevalence : 0.308
## Balanced Accuracy : 0.834
##
## 'Positive' Class : Survived
##
# Look at variable importance
varImpPlot(fit16.rf)
fit17.log <- glm(Survived ~ Sex + Age + Pclass + Fsize + WomanChild14_1st , family = binomial(link='logit'), data = dt.train)
summary(fit17.log)
##
## Call:
## glm(formula = Survived ~ Sex + Age + Pclass + Fsize + WomanChild14_1st,
## family = binomial(link = "logit"), data = dt.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.405 -0.549 0.394 0.570 2.667
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.43792 0.56661 -6.07 1.3e-09 ***
## SexMale -0.19491 0.61333 -0.32 0.751
## Age 0.02125 0.00936 2.27 0.023 *
## Pclass2nd Class 1.34290 0.28191 4.76 1.9e-06 ***
## Pclass3rd Class 2.31714 0.26384 8.78 < 2e-16 ***
## Fsize2 0.13801 0.25307 0.55 0.586
## Fsize3 -0.01202 0.31383 -0.04 0.969
## Fsize4 -0.13071 0.57105 -0.23 0.819
## Fsize5+ 2.73261 0.46133 5.92 3.2e-09 ***
## WomanChild14_1stWomen -0.19292 0.51626 -0.37 0.709
## WomanChild14_1stMen 3.19351 0.56586 5.64 1.7e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.66 on 890 degrees of freedom
## Residual deviance: 724.82 on 880 degrees of freedom
## AIC: 746.8
##
## Number of Fisher Scoring iterations: 5
dt.train$pred.fit17.log <- predict.glm(fit17.log, newdata = dt.train, type = "response")
dt.train$pred.fit17.log <- ifelse(dt.train$pred.fit17.log > 0.5,1,0)
dt.train$pred.fit17.log <- factor(dt.train$pred.fit17.log, levels = c(0,1),labels = c("Survived","Died"))
confusionMatrix(dt.train$pred.fit17.log, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 250 58
## Died 92 491
##
## Accuracy : 0.832
## 95% CI : (0.805, 0.856)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.637
## Mcnemar's Test P-Value : 0.00705
##
## Sensitivity : 0.731
## Specificity : 0.894
## Pos Pred Value : 0.812
## Neg Pred Value : 0.842
## Prevalence : 0.384
## Detection Rate : 0.281
## Detection Prevalence : 0.346
## Balanced Accuracy : 0.813
##
## 'Positive' Class : Survived
##
fit17.dt <- rpart(Survived ~ Sex + Age + Pclass + Fsize + WomanChild14_1st , data=dt.train, method="class")
fancyRpartPlot(fit17.dt)
dt.train$pred.fit17.dt <- predict(fit17.dt, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit17.dt, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 252 58
## Died 90 491
##
## Accuracy : 0.834
## 95% CI : (0.808, 0.858)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.643
## Mcnemar's Test P-Value : 0.0108
##
## Sensitivity : 0.737
## Specificity : 0.894
## Pos Pred Value : 0.813
## Neg Pred Value : 0.845
## Prevalence : 0.384
## Detection Rate : 0.283
## Detection Prevalence : 0.348
## Balanced Accuracy : 0.816
##
## 'Positive' Class : Survived
##
fit17.rf <- randomForest(Survived ~ Sex + Age + Pclass + Fsize + WomanChild14_1st,
data=dt.train, importance=TRUE, ntree=2000)
dt.train$pred.fit17.rf <- predict(fit17.rf, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit17.rf, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 252 49
## Died 90 500
##
## Accuracy : 0.844
## 95% CI : (0.818, 0.867)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.663
## Mcnemar's Test P-Value : 0.000692
##
## Sensitivity : 0.737
## Specificity : 0.911
## Pos Pred Value : 0.837
## Neg Pred Value : 0.847
## Prevalence : 0.384
## Detection Rate : 0.283
## Detection Prevalence : 0.338
## Balanced Accuracy : 0.824
##
## 'Positive' Class : Survived
##
# Look at variable importance
varImpPlot(fit17.rf)
fit18.log <- glm(Survived ~ Pclass + Fsize + WomanChild14_1st , family = binomial(link='logit'), data = dt.train)
summary(fit18.log)
##
## Call:
## glm(formula = Survived ~ Pclass + Fsize + WomanChild14_1st, family = binomial(link = "logit"),
## data = dt.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.320 -0.596 0.399 0.605 2.609
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.19613 0.46206 -6.92 4.6e-12 ***
## Pclass2nd Class 1.20572 0.27350 4.41 1.0e-05 ***
## Pclass3rd Class 2.09045 0.24112 8.67 < 2e-16 ***
## Fsize2 0.13147 0.25272 0.52 0.60
## Fsize3 0.00371 0.31273 0.01 0.99
## Fsize4 -0.17321 0.57068 -0.30 0.76
## Fsize5+ 2.69648 0.44581 6.05 1.5e-09 ***
## WomanChild14_1stWomen 0.35112 0.37603 0.93 0.35
## WomanChild14_1stMen 3.59545 0.40604 8.85 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.66 on 890 degrees of freedom
## Residual deviance: 730.26 on 882 degrees of freedom
## AIC: 748.3
##
## Number of Fisher Scoring iterations: 5
dt.train$pred.fit18.log <- predict.glm(fit18.log, newdata = dt.train, type = "response")
dt.train$pred.fit18.log <- ifelse(dt.train$pred.fit18.log > 0.5,1,0)
dt.train$pred.fit18.log <- factor(dt.train$pred.fit18.log, levels = c(0,1),labels = c("Survived","Died"))
confusionMatrix(dt.train$pred.fit18.log, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 250 58
## Died 92 491
##
## Accuracy : 0.832
## 95% CI : (0.805, 0.856)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.637
## Mcnemar's Test P-Value : 0.00705
##
## Sensitivity : 0.731
## Specificity : 0.894
## Pos Pred Value : 0.812
## Neg Pred Value : 0.842
## Prevalence : 0.384
## Detection Rate : 0.281
## Detection Prevalence : 0.346
## Balanced Accuracy : 0.813
##
## 'Positive' Class : Survived
##
fit18.dt <- rpart(Survived ~ Pclass + Fsize + WomanChild14_1st , data=dt.train, method="class")
fancyRpartPlot(fit18.dt)
dt.train$pred.fit18.dt <- predict(fit18.dt, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit18.dt, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 252 58
## Died 90 491
##
## Accuracy : 0.834
## 95% CI : (0.808, 0.858)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.643
## Mcnemar's Test P-Value : 0.0108
##
## Sensitivity : 0.737
## Specificity : 0.894
## Pos Pred Value : 0.813
## Neg Pred Value : 0.845
## Prevalence : 0.384
## Detection Rate : 0.283
## Detection Prevalence : 0.348
## Balanced Accuracy : 0.816
##
## 'Positive' Class : Survived
##
fit18.rf <- randomForest(Survived ~ Pclass + Fsize + WomanChild14_1st,
data=dt.train, importance=TRUE, ntree=2000)
dt.train$pred.fit18.rf <- predict(fit18.rf, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit18.rf, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 250 56
## Died 92 493
##
## Accuracy : 0.834
## 95% CI : (0.808, 0.858)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.642
## Mcnemar's Test P-Value : 0.00401
##
## Sensitivity : 0.731
## Specificity : 0.898
## Pos Pred Value : 0.817
## Neg Pred Value : 0.843
## Prevalence : 0.384
## Detection Rate : 0.281
## Detection Prevalence : 0.343
## Balanced Accuracy : 0.814
##
## 'Positive' Class : Survived
##
# Look at variable importance
varImpPlot(fit18.rf)
fit19.log <- glm(Survived ~ Pclass + Fsize + WomanChild12_1st , family = binomial(link='logit'), data = dt.train)
summary(fit19.log)
##
## Call:
## glm(formula = Survived ~ Pclass + Fsize + WomanChild12_1st, family = binomial(link = "logit"),
## data = dt.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.325 -0.591 0.401 0.605 2.649
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.3641 0.4922 -6.84 8.2e-12 ***
## Pclass2nd Class 1.2149 0.2742 4.43 9.4e-06 ***
## Pclass3rd Class 2.0923 0.2413 8.67 < 2e-16 ***
## Fsize2 0.1508 0.2536 0.59 0.55
## Fsize3 0.0564 0.3175 0.18 0.86
## Fsize4 -0.1136 0.5764 -0.20 0.84
## Fsize5+ 2.7551 0.4579 6.02 1.8e-09 ***
## WomanChild12_1stWomen 0.4909 0.4039 1.22 0.22
## WomanChild12_1stMen 3.7539 0.4372 8.59 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.66 on 890 degrees of freedom
## Residual deviance: 726.79 on 882 degrees of freedom
## AIC: 744.8
##
## Number of Fisher Scoring iterations: 5
dt.train$pred.fit19.log <- predict.glm(fit19.log, newdata = dt.train, type = "response")
dt.train$pred.fit19.log <- ifelse(dt.train$pred.fit19.log > 0.5,1,0)
dt.train$pred.fit19.log <- factor(dt.train$pred.fit19.log, levels = c(0,1),labels = c("Survived","Died"))
confusionMatrix(dt.train$pred.fit19.log, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 250 57
## Died 92 492
##
## Accuracy : 0.833
## 95% CI : (0.807, 0.857)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.64
## Mcnemar's Test P-Value : 0.00535
##
## Sensitivity : 0.731
## Specificity : 0.896
## Pos Pred Value : 0.814
## Neg Pred Value : 0.842
## Prevalence : 0.384
## Detection Rate : 0.281
## Detection Prevalence : 0.345
## Balanced Accuracy : 0.814
##
## 'Positive' Class : Survived
##
fit19.dt <- rpart(Survived ~ Pclass + Fsize + WomanChild12_1st , data=dt.train, method="class")
fancyRpartPlot(fit19.dt)
dt.train$pred.fit19.dt <- predict(fit19.dt, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit19.dt, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 252 57
## Died 90 492
##
## Accuracy : 0.835
## 95% CI : (0.809, 0.859)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.645
## Mcnemar's Test P-Value : 0.00831
##
## Sensitivity : 0.737
## Specificity : 0.896
## Pos Pred Value : 0.816
## Neg Pred Value : 0.845
## Prevalence : 0.384
## Detection Rate : 0.283
## Detection Prevalence : 0.347
## Balanced Accuracy : 0.817
##
## 'Positive' Class : Survived
##
fit19.rf <- randomForest(Survived ~ Pclass + Fsize + WomanChild12_1st,
data=dt.train, importance=TRUE, ntree=2000)
dt.train$pred.fit19.rf <- predict(fit19.rf, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit19.rf, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 251 57
## Died 91 492
##
## Accuracy : 0.834
## 95% CI : (0.808, 0.858)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.642
## Mcnemar's Test P-Value : 0.00668
##
## Sensitivity : 0.734
## Specificity : 0.896
## Pos Pred Value : 0.815
## Neg Pred Value : 0.844
## Prevalence : 0.384
## Detection Rate : 0.282
## Detection Prevalence : 0.346
## Balanced Accuracy : 0.815
##
## 'Positive' Class : Survived
##
# Look at variable importance
varImpPlot(fit19.rf)
fit20.log_C <- glm(Survived ~ Pclass + Sex + Age + SibSp + Parch + Embarked +
Title + NewTitle + WomanChild12_1st +WomanChild14_1st+
Fsize + FamilySize_dataSet , family = binomial(link='logit'), data = dt.train)
fit20.log_N <- glm(Survived ~ 1 , family = binomial(link='logit'), data = dt.train)
#Best Model: Age + Pclass + Fsize + FamilySize_dataSet + WomanChild12_1st
forwards = step(fit20.log_N,scope=list(lower=formula(fit20.log_N),upper=formula(fit20.log_C)), direction="forward")
## Start: AIC=1189
## Survived ~ 1
##
## Df Deviance AIC
## + WomanChild12_1st 2 882 888
## + WomanChild14_1st 2 887 893
## + NewTitle 4 883 893
## + Title 16 869 903
## + Sex 1 918 922
## + Pclass 2 1083 1089
## + Fsize 4 1108 1118
## + Embarked 2 1161 1167
## + Parch 1 1181 1185
## + Age 1 1181 1185
## <none> 1187 1189
## + FamilySize_dataSet 1 1185 1189
## + SibSp 1 1186 1190
##
## Step: AIC=888
## Survived ~ WomanChild12_1st
##
## Df Deviance AIC
## + Pclass 2 779 789
## + Fsize 4 810 824
## + FamilySize_dataSet 1 826 834
## + SibSp 1 845 853
## + Embarked 2 860 870
## + Parch 1 869 877
## + Age 1 880 888
## <none> 882 888
## + Sex 1 882 890
## + WomanChild14_1st 2 881 891
## + NewTitle 4 879 893
## + Title 16 866 904
##
## Step: AIC=789
## Survived ~ WomanChild12_1st + Pclass
##
## Df Deviance AIC
## + FamilySize_dataSet 1 732 744
## + Fsize 4 727 745
## + SibSp 1 748 760
## + Parch 1 767 779
## + Age 1 773 785
## + Embarked 2 771 785
## <none> 779 789
## + Sex 1 779 791
## + WomanChild14_1st 2 779 793
## + NewTitle 4 778 796
## + Title 16 770 812
##
## Step: AIC=744
## Survived ~ WomanChild12_1st + Pclass + FamilySize_dataSet
##
## Df Deviance AIC
## + Fsize 4 720 740
## + Age 1 727 741
## + Embarked 2 728 744
## <none> 732 744
## + SibSp 1 731 745
## + Parch 1 731 745
## + Sex 1 732 746
## + WomanChild14_1st 2 732 748
## + NewTitle 4 728 748
## + Title 16 721 765
##
## Step: AIC=740
## Survived ~ WomanChild12_1st + Pclass + FamilySize_dataSet + Fsize
##
## Df Deviance AIC
## + Age 1 715 737
## <none> 720 740
## + Embarked 2 717 741
## + Parch 1 719 741
## + Sex 1 719 741
## + SibSp 1 719 741
## + WomanChild14_1st 2 719 743
## + NewTitle 4 717 745
## + Title 16 710 762
##
## Step: AIC=737
## Survived ~ WomanChild12_1st + Pclass + FamilySize_dataSet + Fsize +
## Age
##
## Df Deviance AIC
## <none> 715 737
## + Parch 1 714 738
## + Sex 1 714 738
## + SibSp 1 714 738
## + Embarked 2 712 738
## + NewTitle 4 710 740
## + WomanChild14_1st 2 714 740
## + Title 16 702 756
#Best Model: Age + Pclass + Fsize + FamilySize_dataSet + WomanChild12_1st
backwards = step(fit20.log_C) # Backwards selection is the default
## Start: AIC=764
## Survived ~ Pclass + Sex + Age + SibSp + Parch + Embarked + Title +
## NewTitle + WomanChild12_1st + WomanChild14_1st + Fsize +
## FamilySize_dataSet
##
##
## Step: AIC=764
## Survived ~ Pclass + Sex + Age + SibSp + Parch + Embarked + Title +
## WomanChild12_1st + WomanChild14_1st + Fsize + FamilySize_dataSet
##
##
## Step: AIC=764
## Survived ~ Pclass + Age + SibSp + Parch + Embarked + Title +
## WomanChild12_1st + WomanChild14_1st + Fsize + FamilySize_dataSet
##
## Df Deviance AIC
## - Title 16 711 745
## - WomanChild14_1st 2 699 761
## - WomanChild12_1st 2 700 762
## - SibSp 1 699 763
## - Embarked 2 701 763
## - Parch 1 699 763
## - Fsize 4 706 764
## <none> 698 764
## - Age 1 705 769
## - FamilySize_dataSet 1 706 770
## - Pclass 2 774 836
##
## Step: AIC=745
## Survived ~ Pclass + Age + SibSp + Parch + Embarked + WomanChild12_1st +
## WomanChild14_1st + Fsize + FamilySize_dataSet
##
## Df Deviance AIC
## - WomanChild14_1st 2 711 741
## - SibSp 1 711 743
## - Embarked 2 713 743
## - Parch 1 712 744
## <none> 711 745
## - WomanChild12_1st 2 715 745
## - Fsize 4 721 747
## - Age 1 716 748
## - FamilySize_dataSet 1 718 750
## - Pclass 2 786 816
##
## Step: AIC=741
## Survived ~ Pclass + Age + SibSp + Parch + Embarked + WomanChild12_1st +
## Fsize + FamilySize_dataSet
##
## Df Deviance AIC
## - Embarked 2 713 739
## - SibSp 1 712 740
## - Parch 1 712 740
## <none> 711 741
## - Fsize 4 721 743
## - Age 1 716 744
## - FamilySize_dataSet 1 718 746
## - Pclass 2 787 813
## - WomanChild12_1st 2 970 996
##
## Step: AIC=739
## Survived ~ Pclass + Age + SibSp + Parch + WomanChild12_1st +
## Fsize + FamilySize_dataSet
##
## Df Deviance AIC
## - SibSp 1 714 738
## - Parch 1 714 738
## <none> 713 739
## - Fsize 4 725 743
## - Age 1 719 743
## - FamilySize_dataSet 1 721 745
## - Pclass 2 799 821
## - WomanChild12_1st 2 984 1006
##
## Step: AIC=738
## Survived ~ Pclass + Age + Parch + WomanChild12_1st + Fsize +
## FamilySize_dataSet
##
## Df Deviance AIC
## - Parch 1 715 737
## <none> 714 738
## - Age 1 719 741
## - Fsize 4 726 742
## - FamilySize_dataSet 1 721 743
## - Pclass 2 799 819
## - WomanChild12_1st 2 984 1004
##
## Step: AIC=737
## Survived ~ Pclass + Age + WomanChild12_1st + Fsize + FamilySize_dataSet
##
## Df Deviance AIC
## <none> 715 737
## - Age 1 720 740
## - Fsize 4 727 741
## - FamilySize_dataSet 1 722 742
## - Pclass 2 800 818
## - WomanChild12_1st 2 991 1009
#Best Model: Age + Pclass + Fsize + FamilySize_dataSet + WomanChild12_1st
bothways = step(fit20.log_N, list(lower=formula(fit20.log_N),upper=formula(fit20.log_C)),
direction="both",trace=0)
summary(forwards)
##
## Call:
## glm(formula = Survived ~ WomanChild12_1st + Pclass + FamilySize_dataSet +
## Fsize + Age, family = binomial(link = "logit"), data = dt.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.450 -0.550 0.399 0.514 2.760
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.43121 0.62221 -7.12 1.1e-12 ***
## WomanChild12_1stWomen 0.35656 0.49401 0.72 0.4704
## WomanChild12_1stMen 3.54989 0.52805 6.72 1.8e-11 ***
## Pclass2nd Class 1.37139 0.28349 4.84 1.3e-06 ***
## Pclass3rd Class 2.29608 0.26415 8.69 < 2e-16 ***
## FamilySize_dataSet 0.46001 0.17729 2.59 0.0095 **
## Fsize2 -0.07101 0.26780 -0.27 0.7909
## Fsize3 -0.36491 0.35246 -1.04 0.3005
## Fsize4 -0.77762 0.64549 -1.20 0.2283
## Fsize5+ 1.30217 0.70647 1.84 0.0653 .
## Age 0.02054 0.00934 2.20 0.0278 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.66 on 890 degrees of freedom
## Residual deviance: 714.66 on 880 degrees of freedom
## AIC: 736.7
##
## Number of Fisher Scoring iterations: 5
dt.train$pred.forwards <- predict.glm(forwards, newdata = dt.train, type = "response")
dt.train$pred.forwards <- ifelse(dt.train$pred.forwards > 0.5,1,0)
dt.train$pred.forwards <- factor(dt.train$pred.forwards, levels = c(0,1),labels = c("Survived","Died"))
confusionMatrix(dt.train$pred.forwards, dt.train$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survived Died
## Survived 252 59
## Died 90 490
##
## Accuracy : 0.833
## 95% CI : (0.807, 0.857)
## No Information Rate : 0.616
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.64
## Mcnemar's Test P-Value : 0.014
##
## Sensitivity : 0.737
## Specificity : 0.893
## Pos Pred Value : 0.810
## Neg Pred Value : 0.845
## Prevalence : 0.384
## Detection Rate : 0.283
## Detection Prevalence : 0.349
## Balanced Accuracy : 0.815
##
## 'Positive' Class : Survived
##
dt.test <- readData(test.data,test.VariableType, missingNA)
dt.test$Pclass <- as.factor(dt.test$Pclass)
levels(dt.test$Pclass) <- c("1st Class", "2nd Class", "3rd Class")
dt.test$Sex <- factor(dt.test$Sex, levels=c("female", "male"))
levels(dt.test$Sex) <- c("Female", "Male")
# Graphs and tables from training and testing datasets
mosaicplot(Pclass ~ Sex,
data=dt.test, main="Titanic Test Data: Passenger Survival by Class",
color=c("#8dd3c7", "#fb8072"), shade=FALSE, xlab="", ylab="",
off=c(0), cex.axis=1.4)
which(is.na(dt.test$Fare))
## [1] 153
dt.test$Fare[153] <- median(dt.test$Fare, na.rm=TRUE) #impute median of Fare in the test dataset
# Grab title from passenger names
dt.test$Title <- gsub('(.*, )|(\\..*)', '', dt.test$Name)
table(dt.test$Title)
##
## Col Dona Dr Master Miss Mr Mrs Ms Rev
## 2 1 1 21 78 240 72 1 2
options(digits=2)
with(dt.test,bystats(Age, Title,
fun=function(x)c(Mean=mean(x),Median=median(x))))
##
## c(25, 26, 25, 68, 26, 68, 25, 25) of Age by Title
##
## N Missing Mean Median
## Col 2 0 50.0 50
## Dona 1 0 39.0 39
## Dr 1 0 53.0 53
## Master 17 4 7.4 7
## Miss 64 14 21.8 22
## Mr 183 57 32.0 28
## Mrs 62 10 38.9 36
## Ms 0 1 NA NA
## Rev 2 0 35.5 36
## ALL 332 86 30.3 27
summary(dt.test$Embarked) #The variable Embarked has no missing information
## C Q S
## 102 46 270
## list of all titles
titles <- c("Mr","Mrs","Miss","Master","Don","Rev",
"Dr","Mme","Ms","Major","Lady","Sir",
"Mlle","Col","Capt","the Countess","Jonkheer","Dona")
dt.test$Age <- imputeMedian(dt.test$Age,dt.test$Title,titles)
dt.test$Age[which(dt.test$Title=="Ms")] <-36 #Impute Median of age for Mrs
q<-ggplot(dt.test, aes(x=Age, fill=Pclass)) +
geom_histogram(position="identity", alpha=0.5,bins=90) +
labs(title="Titanic Tes Data: Age by Class")
q1<-q+scale_fill_manual(name="Class",values=c("blue","green", "pink"))
q2<-q1+scale_color_manual(values=c("blue","green", "pink"))
q2
## assigning a new title value to old title(s)
dt.test$NewTitle[dt.test$Title %in% c("Col","Dr", "Rev")] <- 0 #There are a Woman Dr
dt.test$NewTitle[dt.test$Title %in% c("Mrs", "Ms","Dona")] <- 1
dt.test$NewTitle[dt.test$Title %in% c("Master")] <- 2
dt.test$NewTitle[dt.test$Title %in% c("Miss", "Mlle")] <- 3
dt.test$NewTitle[dt.test$Title %in% c("Mr", "Sir", "Jonkheer")] <- 4
dt.test$NewTitle <- factor(dt.test$NewTitle)
dt.test$NewTitle <- as.factor(dt.test$NewTitle)
levels(dt.test$NewTitle) <- c("Special", "Mrs", "Master","Miss","Mr")
table(dt.test$NewTitle)
##
## Special Mrs Master Miss Mr
## 5 74 21 78 240
#Based on the fact that during a disaster the priority are women and children first,
#we are gois to create a variable that separate children, women, and men
#The variable WomanChild12_1st assume the Child is someone under the age 13
dt.test$WomanChild12_1st[dt.test$NewTitle %in% c("Master")] <- 0
dt.test$WomanChild12_1st[dt.test$NewTitle %in% c("Miss") & dt.test$Age<=12] <- 0
dt.test$WomanChild12_1st[dt.test$NewTitle %in% c("Miss") & dt.test$Age>12] <- 1
dt.test$WomanChild12_1st[dt.test$NewTitle %in% c("Mrs")] <- 1
dt.test$WomanChild12_1st[dt.test$NewTitle %in% c("Special") & dt.test$Sex=="Female"] <- 1 #For example for a Dr Woman
dt.test$WomanChild12_1st[dt.test$NewTitle %in% c("Special") & dt.test$Sex=="Male"] <- 2
dt.test$WomanChild12_1st[dt.test$NewTitle %in% c("Mr")] <- 2
dt.test$WomanChild12_1st <- as.factor(dt.test$WomanChild12_1st)
levels(dt.test$WomanChild12_1st) <- c("Children", "Women", "Men")
table(dt.test$WomanChild12_1st, dt.test$NewTitle)
##
## Special Mrs Master Miss Mr
## Children 0 0 21 12 0
## Women 0 74 0 66 0
## Men 5 0 0 0 240
#The variable WomanChild14_1st assume the Child is someone under the age 15
dt.test$WomanChild14_1st[dt.test$NewTitle %in% c("Master")] <-0
dt.test$WomanChild14_1st[dt.test$NewTitle %in% c("Miss") & dt.test$Age<=14] <- 0
dt.test$WomanChild14_1st[dt.test$NewTitle %in% c("Miss") & dt.test$Age>14] <- 1
dt.test$WomanChild14_1st[dt.test$NewTitle %in% c("Mrs")] <- 1
dt.test$WomanChild14_1st[dt.test$NewTitle %in% c("Special") & dt.test$Sex=="Female"] <- 1 #For example for a Dr Woman
dt.test$WomanChild14_1st[dt.test$NewTitle %in% c("Special") & dt.test$Sex=="Male"] <- 2
dt.test$WomanChild14_1st[dt.test$NewTitle %in% c("Mr") & dt.test$Age<=14] <- 0
dt.test$WomanChild14_1st[dt.test$NewTitle %in% c("Mr") & dt.test$Age>14] <- 2
dt.test$WomanChild14_1st <- as.factor(dt.test$WomanChild14_1st)
levels(dt.test$WomanChild14_1st) <- c("Children", "Women", "Men")
table(dt.test$WomanChild14_1st, dt.test$NewTitle)
##
## Special Mrs Master Miss Mr
## Children 0 0 21 12 2
## Women 0 74 0 66 0
## Men 5 0 0 0 238
q<-ggplot(dt.test, aes(x=Age, fill=WomanChild12_1st)) +
geom_histogram(position="identity", alpha=0.5,bins=90) +
labs(title="Titanic Test Data: Survival of Women and Children First code")
q1<-q+scale_fill_manual(name="Women & Children (< 13 years)\nFirst",values=c("green","blue", "pink"))
q2<-q1+scale_color_manual(values=c("green","blue", "pink"))
q2
q<-ggplot(dt.test, aes(x=Age, fill=WomanChild14_1st)) +
geom_histogram(position="identity", alpha=0.5,bins=90) +
labs(title="Titanic Test Data: Survival of Women and Children First code")
q1<-q+scale_fill_manual(name="Women & Children (< 15 years)\nFirst",values=c("green","blue", "pink"))
q2<-q1+scale_color_manual(values=c("green","blue", "pink"))
q2
dt.test$FamilySize <- dt.test$SibSp + dt.test$Parch + 1 # Passeger + #Siblings + Spouse +
#Parents + Children Aboard
#From the table above, we can see:
# FamilySize = 1: passengers alone are more likely to die
# FamilySize = 2, 3 or 4: passengers with 1 to 3 family members are more likely to survive
# FamilySize = 5 or more: passengers with a family size of 5 or more are more likely to die#
dt.test$Fsize[dt.test$FamilySize == 1] <- 1
dt.test$Fsize[dt.test$FamilySize == 2] <- 2
dt.test$Fsize[dt.test$FamilySize == 3] <- 3
dt.test$Fsize[dt.test$FamilySize == 4] <- 4
dt.test$Fsize[dt.test$FamilySize >= 5] <- 5
dt.test$Fsize <- as.factor(dt.test$Fsize)
levels(dt.test$Fsize) <- c("1", "2", "3","4","5+")
table(dt.test$Fsize)
##
## 1 2 3 4 5+
## 253 74 57 14 20
with(dt.test,table(Fsize, Sex))
## Sex
## Fsize Female Male
## 1 68 185
## 2 36 38
## 3 30 27
## 4 10 4
## 5+ 8 12
par(mfrow=c(1,1))
boxplot(Age ~ FamilySize, data =dt.test, xlab="Family Size on the Ship",
ylab="Age (years)", main="Titanic Test Data",col=c(2:8,"pink","orange"))
#Family Name
dt.test$FamilyName <- gsub(",.*$", "", dt.test$Name)
#To create a FamilyID, we are going to paste the Family size Aboard on Titanic with the
#Passenger's Surname
dt.test$FamilyID <- paste(as.character(dt.test$FamilySize), dt.test$FamilyName, sep="")
dt.test$FamilyID_Embk_Ticket <- paste(dt.test$FamilyID,dt.test$Embarked, as.character(dt.test$Ticket), sep="_")
dt.test$FamilyID_dataSet <- match(dt.test$FamilyID_Embk_Ticket, unique(dt.test$FamilyID_Embk_Ticket))
dt.test$FamilySize_dataSet <- ave(dt.test$FamilyID_dataSet,dt.test$FamilyID_dataSet, FUN =length)
summary(dt.test$FamilySize_dataSet)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 1.0 1.0 1.2 1.0 4.0
plot(dt.test$FamilyID_dataSet, dt.test$FamilySize, xlab="Family ID in the dataset",
ylab="Family Size on the Ship",main= "Titanic Test dataset")
plot(dt.test$FamilySize_dataSet,dt.test$FamilySize, xlab="Family Size in the dataset",
ylab="Family Size on the Ship",main= "Titanic Test dataset")
table(factor(dt.test$FamilySize),factor(dt.test$FamilySize_dataSet))
##
## 1 2 3 4
## 1 253 0 0 0
## 2 54 20 0 0
## 3 32 22 3 0
## 4 6 8 0 0
## 5 4 0 3 0
## 6 1 2 0 0
## 7 1 0 3 0
## 8 0 2 0 0
## 11 0 0 0 4
#Fare
q<-ggplot(dt.test, aes(x=Fare, fill=Pclass)) +
geom_histogram(position="identity", alpha=0.5,bins=90) +
labs(title="Titanic Test Data: Fare by Class")
q1<-q+scale_fill_manual(name="Class",values=c("green","blue", "pink"))
q2<-q1+scale_color_manual(values=c("green","blue", "pink"))
q2
with(dt.test, {
boxplot(Fare ~ FamilySize, xlab="Family Size on the Titanic",
ylab="Fare", main="Titanic Test Data", col=c(2:8,"pink","orange"))
})
par(mfrow=c(1,2))
with(dt.test, {
boxplot(Fare ~ Fsize, xlab="Family Size on the Titanic",
ylab="Fare", main="Titanic Test Data", col=2:10)
boxplot(Fare ~ Fsize, xlab="Family Size on the Titanic",
ylab="Fare", main="Titanic Test Data", col=2:10, ylim=c(0,250))
})
q<-ggplot(dt.test, aes(x=Fare, fill=FamilySize)) +
geom_histogram(position="identity", alpha=0.5,bins=90) +
labs(title="Titanic Test Data: Fare by Family Size")
q<-ggplot(dt.test, aes(x=Fare, fill=Fsize)) +
geom_histogram(position="identity", alpha=0.5,bins=90) +
labs(title="Titanic Test Data: Fare by Family Size")
dt.test$FamilyID <- paste(as.character(dt.test$FamilySize), dt.test$FamilyName, sep="")
set.seed(12345)
#fit19.log <- glm(Survived ~ Pclass + Fsize + WomanChild12_1st , family = binomial(link='logit'), data = dt.train)
dt.test$pred.fit19.log <- predict.glm(fit19.log, newdata = dt.test, type = "response")
dt.test$pred.fit19.log <- ifelse(dt.test$pred.fit19.log > 0.5,0,1)
#Submitting
submit <- data.frame(PassengerId = dt.test$PassengerId, Survived = dt.test$pred.fit19.log)
write.csv(submit, file = "Prediction_model19_logistic.csv", row.names = FALSE)
fit13.log <- glm(Survived ~ Sex + Age + Pclass + Fsize + NewTitle, family = binomial(link='logit'), data = dt.train)
dt.test$pred.fit13.log <- predict.glm(fit13.log, newdata = dt.test, type = "response")
dt.test$pred.fit13.log <- ifelse(dt.test$pred.fit13.log > 0.5,0,1)
#Submitting
submit <- data.frame(PassengerId = dt.test$PassengerId, Survived = dt.test$pred.fit13.log)
write.csv(submit, file = "Prediction_model13_logistic.csv", row.names = FALSE)
Using Stepwise, the best model has the same variables using backwards, backwards or bothways
fit20.log_C <- glm(Survived ~ Pclass + Sex + Age + SibSp + Parch + Embarked + Fare+
Title + NewTitle + WomanChild12_1st +WomanChild14_1st+
Fsize + FamilySize_dataSet , family = binomial(link='logit'), data = dt.train)
fit20.log_N <- glm(Survived ~ 1 , family = binomial(link='logit'), data = dt.train)
#Best Model: Age + Pclass + Fsize + FamilySize_dataSet + WomanChild12_1st
forwards = step(fit20.log_N,scope=list(lower=formula(fit20.log_N),upper=formula(fit20.log_C)), direction="forward")
## Start: AIC=1189
## Survived ~ 1
##
## Df Deviance AIC
## + WomanChild12_1st 2 882 888
## + WomanChild14_1st 2 887 893
## + NewTitle 4 883 893
## + Title 16 869 903
## + Sex 1 918 922
## + Pclass 2 1083 1089
## + Fsize 4 1108 1118
## + Fare 1 1118 1122
## + Embarked 2 1161 1167
## + Parch 1 1181 1185
## + Age 1 1181 1185
## <none> 1187 1189
## + FamilySize_dataSet 1 1185 1189
## + SibSp 1 1186 1190
##
## Step: AIC=888
## Survived ~ WomanChild12_1st
##
## Df Deviance AIC
## + Pclass 2 779 789
## + Fsize 4 810 824
## + FamilySize_dataSet 1 826 834
## + SibSp 1 845 853
## + Fare 1 852 860
## + Embarked 2 860 870
## + Parch 1 869 877
## + Age 1 880 888
## <none> 882 888
## + Sex 1 882 890
## + WomanChild14_1st 2 881 891
## + NewTitle 4 879 893
## + Title 16 866 904
##
## Step: AIC=789
## Survived ~ WomanChild12_1st + Pclass
##
## Df Deviance AIC
## + FamilySize_dataSet 1 732 744
## + Fsize 4 727 745
## + SibSp 1 748 760
## + Parch 1 767 779
## + Age 1 773 785
## + Embarked 2 771 785
## <none> 779 789
## + Sex 1 779 791
## + Fare 1 779 791
## + WomanChild14_1st 2 779 793
## + NewTitle 4 778 796
## + Title 16 770 812
##
## Step: AIC=744
## Survived ~ WomanChild12_1st + Pclass + FamilySize_dataSet
##
## Df Deviance AIC
## + Fsize 4 720 740
## + Age 1 727 741
## + Fare 1 729 743
## + Embarked 2 728 744
## <none> 732 744
## + SibSp 1 731 745
## + Parch 1 731 745
## + Sex 1 732 746
## + WomanChild14_1st 2 732 748
## + NewTitle 4 728 748
## + Title 16 721 765
##
## Step: AIC=740
## Survived ~ WomanChild12_1st + Pclass + FamilySize_dataSet + Fsize
##
## Df Deviance AIC
## + Age 1 715 737
## + Fare 1 716 738
## <none> 720 740
## + Embarked 2 717 741
## + Parch 1 719 741
## + Sex 1 719 741
## + SibSp 1 719 741
## + WomanChild14_1st 2 719 743
## + NewTitle 4 717 745
## + Title 16 710 762
##
## Step: AIC=737
## Survived ~ WomanChild12_1st + Pclass + FamilySize_dataSet + Fsize +
## Age
##
## Df Deviance AIC
## + Fare 1 712 736
## <none> 715 737
## + Parch 1 714 738
## + Sex 1 714 738
## + SibSp 1 714 738
## + Embarked 2 712 738
## + NewTitle 4 710 740
## + WomanChild14_1st 2 714 740
## + Title 16 702 756
##
## Step: AIC=736
## Survived ~ WomanChild12_1st + Pclass + FamilySize_dataSet + Fsize +
## Age + Fare
##
## Df Deviance AIC
## <none> 712 736
## + Parch 1 711 737
## + Sex 1 711 737
## + SibSp 1 712 738
## + Embarked 2 710 738
## + NewTitle 4 706 738
## + WomanChild14_1st 2 711 739
## + Title 16 699 755
#Best Model: Age + Pclass + Fsize + FamilySize_dataSet + WomanChild12_1st
backwards = step(fit20.log_C) # Backwards selection is the default
## Start: AIC=764
## Survived ~ Pclass + Sex + Age + SibSp + Parch + Embarked + Fare +
## Title + NewTitle + WomanChild12_1st + WomanChild14_1st +
## Fsize + FamilySize_dataSet
##
##
## Step: AIC=764
## Survived ~ Pclass + Sex + Age + SibSp + Parch + Embarked + Fare +
## Title + WomanChild12_1st + WomanChild14_1st + Fsize + FamilySize_dataSet
##
##
## Step: AIC=764
## Survived ~ Pclass + Age + SibSp + Parch + Embarked + Fare + Title +
## WomanChild12_1st + WomanChild14_1st + Fsize + FamilySize_dataSet
##
## Df Deviance AIC
## - Title 16 709 745
## - WomanChild14_1st 2 696 760
## - Embarked 2 698 762
## - SibSp 1 696 762
## - WomanChild12_1st 2 698 762
## - Parch 1 696 762
## <none> 696 764
## - Fsize 4 704 764
## - Fare 1 698 764
## - Age 1 702 768
## - FamilySize_dataSet 1 703 769
## - Pclass 2 736 800
##
## Step: AIC=745
## Survived ~ Pclass + Age + SibSp + Parch + Embarked + Fare + WomanChild12_1st +
## WomanChild14_1st + Fsize + FamilySize_dataSet
##
## Df Deviance AIC
## - WomanChild14_1st 2 709 741
## - Embarked 2 711 743
## - SibSp 1 709 743
## - Parch 1 710 744
## <none> 709 745
## - Fare 1 711 745
## - WomanChild12_1st 2 713 745
## - Age 1 713 747
## - Fsize 4 719 747
## - FamilySize_dataSet 1 716 750
## - Pclass 2 750 782
##
## Step: AIC=741
## Survived ~ Pclass + Age + SibSp + Parch + Embarked + Fare + WomanChild12_1st +
## Fsize + FamilySize_dataSet
##
## Df Deviance AIC
## - Embarked 2 711 739
## - SibSp 1 710 740
## - Parch 1 710 740
## <none> 709 741
## - Fare 1 711 741
## - Age 1 713 743
## - Fsize 4 719 743
## - FamilySize_dataSet 1 716 746
## - Pclass 2 751 779
## - WomanChild12_1st 2 966 994
##
## Step: AIC=739
## Survived ~ Pclass + Age + SibSp + Parch + Fare + WomanChild12_1st +
## Fsize + FamilySize_dataSet
##
## Df Deviance AIC
## - SibSp 1 711 737
## - Parch 1 712 738
## <none> 711 739
## - Fare 1 713 739
## - Age 1 715 741
## - Fsize 4 722 742
## - FamilySize_dataSet 1 718 744
## - Pclass 2 753 777
## - WomanChild12_1st 2 979 1003
##
## Step: AIC=737
## Survived ~ Pclass + Age + Parch + Fare + WomanChild12_1st + Fsize +
## FamilySize_dataSet
##
## Df Deviance AIC
## - Parch 1 712 736
## <none> 711 737
## - Fare 1 714 738
## - Age 1 715 739
## - Fsize 4 723 741
## - FamilySize_dataSet 1 718 742
## - Pclass 2 753 775
## - WomanChild12_1st 2 979 1001
##
## Step: AIC=736
## Survived ~ Pclass + Age + Fare + WomanChild12_1st + Fsize + FamilySize_dataSet
##
## Df Deviance AIC
## <none> 712 736
## - Fare 1 715 737
## - Age 1 716 738
## - Fsize 4 724 740
## - FamilySize_dataSet 1 719 741
## - Pclass 2 754 774
## - WomanChild12_1st 2 985 1005
#Best Model: Age + Pclass + Fsize + FamilySize_dataSet + WomanChild12_1st
bothways = step(fit20.log_N, list(lower=formula(fit20.log_N),upper=formula(fit20.log_C)),
direction="both",trace=0)
summary(forwards)
##
## Call:
## glm(formula = Survived ~ WomanChild12_1st + Pclass + FamilySize_dataSet +
## Fsize + Age + Fare, family = binomial(link = "logit"), data = dt.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.366 -0.535 0.404 0.512 2.863
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.21968 0.64107 -6.58 4.6e-11 ***
## WomanChild12_1stWomen 0.48603 0.50674 0.96 0.33749
## WomanChild12_1stMen 3.67084 0.53981 6.80 1.0e-11 ***
## Pclass2nd Class 1.14081 0.31682 3.60 0.00032 ***
## Pclass3rd Class 2.02139 0.31345 6.45 1.1e-10 ***
## FamilySize_dataSet 0.46468 0.17932 2.59 0.00956 **
## Fsize2 0.00299 0.27355 0.01 0.99129
## Fsize3 -0.26150 0.36034 -0.73 0.46802
## Fsize4 -0.65904 0.65460 -1.01 0.31404
## Fsize5+ 1.53487 0.73665 2.08 0.03720 *
## Age 0.01853 0.00943 1.96 0.04946 *
## Fare -0.00418 0.00267 -1.56 0.11767
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.66 on 890 degrees of freedom
## Residual deviance: 711.77 on 879 degrees of freedom
## AIC: 735.8
##
## Number of Fisher Scoring iterations: 5
dt.test$pred.forwards <- predict.glm(forwards, newdata = dt.test, type = "response")
dt.test$pred.forwards <- ifelse(dt.test$pred.forwards > 0.5,0,1)
#Submitting
submit <- data.frame(PassengerId = dt.test$PassengerId, Survived = dt.test$pred.forwards)
write.csv(submit, file = "Prediction_model21_StepF_logistic.csv", row.names = FALSE)
fit9.dt <- rpart(Survived ~ Sex + Age + Pclass + Fsize, data=dt.train, method="class")
dt.test$pred.fit9.dt <- predict(fit9.dt, newdata=dt.test,type='class')
dt.test$pred.fit9.dt[dt.test$pred.fit9.dt=="Died"]<-0
## Warning in `[<-.factor`(`*tmp*`, dt.test$pred.fit9.dt == "Died", value =
## structure(c(NA, : invalid factor level, NA generated
dt.test$pred.fit9.dt[dt.test$pred.fit9.dt=="Survived"]<-1
## Warning in `[<-.factor`(`*tmp*`, dt.test$pred.fit9.dt == "Survived", value
## = structure(c(NA_integer_, : invalid factor level, NA generated
dt.test$pred.fit9.dt <- as.numeric(dt.test$pred.fit9.dt)
#Submitting
submit <- data.frame(PassengerId = dt.test$PassengerId, Survived = dt.test$pred.fit9.dt)
write.csv(submit, file = "Prediction_model9_DecisionTree.csv", row.names = FALSE)
fit6.rf <- randomForest(Survived ~ Sex + Age + Pclass + SibSp + Parch + Embarked,
data=dt.train, importance=TRUE, ntree=2000)
dt.test$pred.fit6.rf <- predict(fit6.rf, newdata=dt.test,type='class')
dt.test$pred.fit6.randomf[dt.test$pred.fit6.rf=="Died"]<-0
dt.test$pred.fit6.randomf[dt.test$pred.fit6.rf=="Survived"]<-1
dt.test$pred.fit6.randomf <- as.numeric(dt.test$pred.fit6.randomf)
#Submitting
submit <- data.frame(PassengerId = dt.test$PassengerId, Survived = dt.test$pred.fit6.randomf)
write.csv(submit, file = "Prediction_model6_RandomForest.csv", row.names = FALSE)